arxiv: 2604.22868 · v1 · submitted 2026-04-23 · 💻 cs.CV · cs.AI

Recognition: unknown

Probing Visual Planning in Image Editing Models

Zhimu Zhou , Yanpeng Zhao , Qiuyu Liao , Bo Zhao , Xiaojian Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual planningimage editing modelsabstract puzzlesmaze navigationqueen placementsingle-step transformationgeneralizationneural reasoning gap

0 comments

The pith

Reformulating visual planning as single-step image editing lets models generalize from small abstract puzzles to larger and new ones, though they still lag human efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that visual planning can be isolated and tested using abstract spatial puzzles instead of language-based approaches. It introduces a paradigm that completes the entire planning process through one image transformation rather than iterative generation steps. Evaluation on procedurally generated maze and queen placement problems shows that leading editing models fail when given no examples but acquire broad capabilities after exposure to basic cases. This capability transfers to harder versions and unfamiliar layouts, yet the fastest model still requires more time than people solving the same problems from scratch.

Core claim

The editing-as-reasoning paradigm reformulates visual planning tasks into single-step image transformations. When tested on the AMAZE dataset featuring maze navigation and queen placement puzzles, leading image editing models struggle in zero-shot settings. However, finetuning on basic scale puzzles leads to strong generalization to larger in-domain scales and out-of-domain scales and geometries. The best performing model still fails to match the zero-shot efficiency of human solvers, indicating a gap in neural visual reasoning.

What carries the argument

The editing-as-reasoning paradigm that recasts visual planning as a single image edit operation, evaluated on the AMAZE dataset of mazes and queen problems using pixel-wise fidelity and logical validity metrics.

If this is right

Finetuning on basic scales produces generalization to larger in-domain scales.
The same finetuning extends to out-of-domain scales and geometries.
Automatic evaluation becomes feasible through pixel-wise fidelity and logical validity on abstract puzzles.
A persistent efficiency gap remains between the best neural model and human zero-shot solvers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training image models on transformation tasks may be sufficient to induce planning behavior without separate reasoning modules.
Abstract puzzle suites could serve as diagnostic benchmarks for planning deficits across other vision architectures.
Closing the remaining speed gap would likely require architectural changes that reduce the cost of each editing step.

Load-bearing premise

Abstract puzzles such as Maze and Queen problems, together with pixel-wise and logical validity metrics, sufficiently isolate and measure intrinsic visual planning separate from visual recognition.

What would settle it

A model trained only on small-scale mazes that then fails to produce valid solutions on larger mazes or new geometries in the same test set would falsify the generalization result; conversely, any model achieving human-level zero-shot solving speed on the full AMAZE suite would falsify the efficiency gap claim.

Figures

Figures reproduced from arXiv: 2604.22868 by Bo Zhao, Qiuyu Liao, Xiaojian Ma, Yanpeng Zhao, Zhimu Zhou.

**Figure 1.** Figure 1: The AMAZE tasks. Spatial reasoning through visual planning is a cornerstone in human intelligence. While humans can navigate complex visual environments intuitively, machine learning models have been predominantly relying on verbal-centric approaches, such as translating these inherently visual reasoning problems into text for large language models (LLMs) (Yang et al., 2022; Wu et al., 2023; Wang et al… view at source ↗

**Figure 2.** Figure 2: Overview of EAR. Left: the EAR paradigm. Right: automatic evaluation. Yellow and red highlight the generated image’s overlap with the solution and non-solution areas, respectively. 2025b); while others attempt direct-generation methods (Wiedemer et al., 2025), yet a comprehensive understanding of the intrinsic visual planning capabilities within these editing-based models remains elusive. To bridge this ga… view at source ↗

**Figure 3.** Figure 3: Solutions from different denoising steps ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Zero-shot generalization. Left: PASS@5 matrix for 3 × 3 models. Right: Comparison between 3 × 3 and 8 × 8 7 training. Fine-tuning on 7 yields the best generalization across other geometry types. We evaluate Bagel’s zeroshot generalization across geometry types (See [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Generalization across scales for Maze (top) and Queen [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Data scaling. Scaling up training data. We analyze the effect of data scaling with N ∈ {800, 1600, 3200, 6400} under a fixed compute budget of 1000 training steps. In general, scaling up training data initially yields slight improvements on all tasks, but the gains become marginal after N > 1600 (see [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Examples of failure modes in Maze (first two rows) and Queen (last row). [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 7.** Figure 7: Compute scaling. Scaling up training compute. We double the training duration from 500 to 1000 steps (equivalent to increasing from 2.5 to 5 epochs) while maintaining a fixed training set of 6400 samples. Overall, scaling up training compute yields consistent improvements except for slight drops on Maze at step 800 and on Queen at step 700. Interestingly, gains are generally marginal over 500–700 steps an… view at source ↗

**Figure 9.** Figure 9: Success rates of humans and Bagel under different time budgets. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Correlation between model and human [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Data scaling on cross-domain performance. We further investigate how scaling the training data affects cross-domain performance, where models are trained on a single geometry and evaluated across different geometries. We train models on 8×8 7 mazes and 8×8 ⃝ mazes with fixed steps (500), and evaluate cross-domain performance on all geometry types across all scales from 3 × 3 to 16 × 16. As shown in [PIT… view at source ↗

**Figure 12.** Figure 12: Joint Scaling of Data and Compute [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Fatal cases for □ and △ mazes. Left: boundary violation; Right: incomplete paths. bottleneck of visual planning is jointly constrained by optimization capacity and the ability to fully absorb the training distribution. D ADDITIONAL ERROR CASES FOR MAZE TASK We provide an additional set of examples across different geometry types, including □, and △ mazes. Constraint violations are more frequent when the a… view at source ↗

read the original abstract

Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step planning-by-generation paradigm. In this work, we present EAR, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic evaluation of autoregressive and diffusion-based models in terms of both pixel-wise fidelity and logical validity. We assess leading proprietary and open-source editing models. The results show that they all struggle in the zero-shot setting, finetuning on basic scales enables remarkable generalization to larger in-domain scales and out-of-domain scales and geometries. However, our best model that runs on high-end hardware fails to match the zero-shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Image editing models fail zero-shot on maze and queen puzzles but generalize after small-scale fine-tuning, though the benchmark may not cleanly isolate planning from shortcuts.

read the letter

The main takeaway is that current image editing models all struggle in the zero-shot setting on these abstract puzzles, but fine-tuning on basic scales produces strong generalization to larger in-domain sizes and out-of-domain geometries, while the best model still cannot match human zero-shot efficiency. This is the concrete result the paper puts forward. The new elements are the editing-as-reasoning reformulation that turns planning into a single image transformation and the AMAZE dataset of procedurally generated maze and queen problems. The abstract nature of the tasks supports automatic evaluation through pixel-wise fidelity and logical validity metrics, which is a practical step for testing both autoregressive and diffusion models without relying on language. The reported generalization pattern is worth noting if it holds under closer inspection. The soft spot is that the tasks and metrics may not fully separate intrinsic visual planning from shortcut learning or low-level pattern completion. The paper claims the abstract puzzles isolate reasoning from recognition, but the abstract provides no specific controls or ablations to rule out models succeeding via color heuristics, direct mappings, or dataset regularities that skip actual path construction or placement search. Without that, the interpretation of a persistent gap in neural visual reasoning rests on an assumption that needs more evidence. The human efficiency comparison also requires precise definitions of what is being measured to be fully convincing. This paper is aimed at researchers working on visual reasoning, spatial intelligence, and image generation models. A reader interested in new benchmarks for testing planning capabilities would find the dataset and paradigm useful. It deserves peer review because the empirical results on generalization are concrete enough to check and the benchmark idea has potential, even if the methods section will need scrutiny for robustness.

Referee Report

2 major / 1 minor

Summary. The paper introduces the EAR (editing-as-reasoning) paradigm that reformulates visual planning tasks as single-step image editing rather than iterative generation. It presents the AMAZE dataset of procedurally generated Maze and Queen puzzles to probe image editing models, using pixel-wise fidelity and logical validity metrics for automatic evaluation. The central claims are that leading proprietary and open-source editing models fail in zero-shot settings, that fine-tuning on basic scales yields strong generalization to larger in-domain scales and out-of-domain geometries, and that even the best model on high-end hardware cannot match the zero-shot efficiency of human solvers.

Significance. If the empirical results hold and the tasks genuinely isolate planning, the work would usefully document limitations of current editing models on visual reasoning and demonstrate the value of the editing-as-reasoning reformulation. The automatic evaluation protocol for abstract puzzles is a constructive contribution. However, the significance is tempered by the absence of evidence that the chosen metrics and puzzles require constructing or searching solution paths rather than permitting low-level shortcuts.

major comments (2)

[Abstract] Abstract: the claim that AMAZE 'facilitates automatic evaluation' and isolates 'intrinsic reasoning from visual recognition' is load-bearing for the zero-shot failure, generalization, and human-gap conclusions, yet no validation is supplied that logical validity cannot be satisfied by color-based heuristics, direct pattern completion, or dataset regularities that bypass path construction or search.
[Abstract] The efficiency comparison (best model on high-end hardware vs. human zero-shot solvers) is central to the final claim of a 'persistent gap in neural visual reasoning,' but the manuscript provides no quantitative details on how human solving time or step count is measured, how model inference latency is recorded, or controls for hardware normalization.

minor comments (1)

[Abstract] The acronym EAR is expanded only once in the abstract; subsequent uses should either repeat the expansion or define it explicitly in the introduction for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments point by point below, indicating where we will revise the manuscript to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that AMAZE 'facilitates automatic evaluation' and isolates 'intrinsic reasoning from visual recognition' is load-bearing for the zero-shot failure, generalization, and human-gap conclusions, yet no validation is supplied that logical validity cannot be satisfied by color-based heuristics, direct pattern completion, or dataset regularities that bypass path construction or search.

Authors: We agree that explicit checks against low-level shortcuts would strengthen the isolation claim. The procedural generation of AMAZE varies maze sizes, queen counts, and board geometries precisely to reduce dataset regularities, and logical validity is defined to require complete, attack-free solutions (for queens) or connected paths from start to goal (for mazes). Zero-shot model failures even on the smallest instances suggest that simple color or pattern heuristics are not being exploited. Nevertheless, to address the concern directly, the revision will add an ablation subsection that evaluates heuristic baselines (color-matching, template completion) and reports their logical-validity rates on held-out scales and geometries, confirming that these shortcuts do not suffice. revision: yes
Referee: [Abstract] The efficiency comparison (best model on high-end hardware vs. human zero-shot solvers) is central to the final claim of a 'persistent gap in neural visual reasoning,' but the manuscript provides no quantitative details on how human solving time or step count is measured, how model inference latency is recorded, or controls for hardware normalization.

Authors: We acknowledge that the current description of the efficiency comparison lacks the necessary methodological detail for reproducibility and fair interpretation. In the revised manuscript we will expand the human-study and runtime sections to specify: (i) the exact protocol used to record human solving time and step count (including instructions given to participants and any time limits), (ii) the hardware and software stack on which model inference latency was measured, and (iii) any normalization steps applied to place human and model times on a comparable footing. These additions will allow readers to assess the reported gap with full context. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on new dataset and tasks

full rationale

The paper introduces EAR paradigm and AMAZE dataset with Maze/Queen puzzles, then reports empirical zero-shot and finetuning results on editing models using pixel-wise fidelity and logical validity metrics. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the provided text. Claims rest on direct model evaluations rather than any derivation that reduces to its own inputs by construction. The assumption that these tasks isolate planning is a methodological choice open to external scrutiny but does not create circularity in the reported findings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the introduced EAR paradigm and AMAZE dataset as valid probes for visual planning, plus the assumption that automatic pixel and logical metrics capture reasoning ability.

invented entities (2)

EAR no independent evidence
purpose: editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation
Introduced to avoid computational cost of step-by-step planning-by-generation.
AMAZE no independent evidence
purpose: procedurally generated dataset of abstract Maze and Queen puzzles for automatic evaluation of visual planning
Created to isolate reasoning from recognition and enable scalable testing.

pith-pipeline@v0.9.0 · 5518 in / 1272 out tokens · 60399 ms · 2026-05-09T21:36:36.419356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Reinforcing

Wu, Junfei and Guan, Jian and Feng, Kaituo and Liu, Qiang and Wu, Shu and Wang, Liang and Wu, Wei and Tan, Tieniu , booktitle =. Reinforcing
[2]

Visual Instruction Tuning , volume =

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , booktitle =. Visual Instruction Tuning , volume =
[3]

GitHub repository , howpublished =

Rob Dawson , title =. GitHub repository , howpublished =. 2021 , publisher =

2021
[4]

Vision research , volume=

Development of human visual function , author=. Vision research , volume=. 2011 , publisher=

2011
[5]

Wiley Interdisciplinary Reviews: Cognitive Science , volume=

Development of visual perception , author=. Wiley Interdisciplinary Reviews: Cognitive Science , volume=. 2010 , publisher=

2010
[6]

The Twelfth International Conference on Learning Representations , year=

Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models , author=. The Twelfth International Conference on Learning Representations , year=
[7]

2025 , eprint=

VFScale: Intrinsic Reasoning through Verifier-Free Test-time Scalable Diffusion Model , author=. 2025 , eprint=

2025
[8]

2025 , eprint=

Qwen-Image Technical Report , author=. 2025 , eprint=

2025
[9]

Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =

Wu, Chengyue and Chen, Xiaokang and Wu, Zhiyu and Ma, Yiyang and Liu, Xingchao and Pan, Zizheng and Liu, Wen and Xie, Zhenda and Yu, Xingkai and Ruan, Chong and Luo, Ping , title =. Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) , month =. 2025 , pages =

2025
[10]

U-Net: convolutional networks for biomedical image segmentation , booktitle =

Olaf Ronneberger and Philipp Fischer and Thomas Brox , editor =. U-Net: Convolutional Networks for Biomedical Image Segmentation , booktitle =. 2015 , url =. doi:10.1007/978-3-319-24574-4\_28 , timestamp =

work page doi:10.1007/978-3-319-24574-4 2015
[11]

Advances in Neural Information Processing Systems , editor=

Flamingo: a Visual Language Model for Few-Shot Learning , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022
[12]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Xu, Guowei and Jin, Peng and Wu, Ziang and Li, Hao and Song, Yibing and Sun, Lichao and Yuan, Li , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025
[13]

Improve Vision Language Model Chain-of-thought Reasoning

Zhang, Ruohong and Zhang, Bowen and Li, Yanghao and Zhang, Haotian and Sun, Zhiqing and Gan, Zhe and Yang, Yinfei and Pang, Ruoming and Yang, Yiming. Improve Vision Language Model Chain-of-thought Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.82

work page doi:10.18653/v1/2025.acl-long.82 2025
[14]

Advances in Neural Information Processing Systems , editor=

Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022
[15]

M ath C oder- VL : Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Wang, Ke and Pan, Junting and Wei, Linda and Zhou, Aojun and Shi, Weikang and Lu, Zimu and Xiao, Han and Yang, Yunqiao and Ren, Houxing and Zhan, Mingjie and Li, Hongsheng. M ath C oder- VL : Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2...

work page doi:10.18653/v1/2025.findings-acl.128 2025
[16]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
[17]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

2022
[18]

2021 , eprint=

Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

2021
[19]

2025 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=

2025
[20]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Li, Yun and Zhang, Yiming and Lin, Tao and Liu, Xiangrui and Cai, Wenxiao and Liu, Zheng and Zhao, Bo , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025
[21]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Thinking in space: How multimodal large language models see, remember, and recall spaces , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[22]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Wang, Wenqi and Tan, Reuben and Zhu, Pengyue and Yang, Jianwei and Yang, Zhengyuan and Wang, Lijuan and Kolobov, Andrey and Gao, Jianfeng and Gong, Boqing , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025
[23]

NuScenes-spatialQA: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving , author=. arXiv preprint arXiv:2504.03164 , year=

work page arXiv
[24]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Peebles, William and Xie, Saining , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2023 , pages =

2023
[25]

2023 , eprint=

InstructPix2Pix: Learning to Follow Image Editing Instructions , author=. 2023 , eprint=

2023
[26]

The Eleventh International Conference on Learning Representations , year=

Flow Matching for Generative Modeling , author=. The Eleventh International Conference on Learning Representations , year=
[27]

The Eleventh International Conference on Learning Representations , year=

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. The Eleventh International Conference on Learning Representations , year=
[28]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team , doi =. arXiv preprint arXiv:2405.09818 , title =

work page internal anchor Pith review arXiv
[29]

Emu3: Next-Token Prediction is All You Need

Emu3: Next-Token Prediction is All You Need , author=. arXiv preprint arXiv:2409.18869 , year=

work page internal anchor Pith review arXiv
[30]

Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=

Microsoft coco: Common objects in context , author=. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=. 2014 , organization=

2014
[31]

arXiv e-prints , pages=

Seeing clearly, answering incorrectly: A multimodal robustness benchmark for evaluating mllms on leading questions , author=. arXiv e-prints , pages=
[32]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and Wei, Cong and Yu, Botao and Yuan, Ruibin and Sun, Renliang and Yin, Ming and Zheng, Boyuan and Yang, Zhenzhu and Liu, Yibo and Huang, Wenhao and Sun, Huan and Su, Yu and Chen, Wenhu , title =...

2024
[33]

2025 , eprint=

Humanity's Last Exam , author=. 2025 , eprint=

2025
[34]

2025 , url=

Chaoyou Fu and Peixian Chen and Yunhang Shen and Yulei Qin and Mengdan Zhang and Xu Lin and Jinrui Yang and Xiawu Zheng and Ke Li and Xing Sun and Yunsheng Wu and Rongrong Ji and Caifeng Shan and Ran He , booktitle=. 2025 , url=

2025
[35]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Li, Bohao and Ge, Yuying and Ge, Yixiao and Wang, Guangzhi and Wang, Rui and Zhang, Ruimao and Shan, Ying , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

2024
[36]

LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models , year=

Xu, Peng and Shao, Wenqi and Zhang, Kaipeng and Gao, Peng and Liu, Shuo and Lei, Meng and Meng, Fanqing and Huang, Siyuan and Qiao, Yu and Luo, Ping , journal=. LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models , year=
[37]

ImageNet: A large-scale hierarchical image database , year=

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei , booktitle=. ImageNet: A large-scale hierarchical image database , year=
[38]

SpatialBot: Precise Spatial Understanding with Vision Language Models , year=

Cai, Wenxiao and Ponomarenko, Iaroslav and Yuan, Jianhao and Li, Xiaoqi and Yang, Wankou and Dong, Hao and Zhao, Bo , booktitle=. SpatialBot: Precise Spatial Understanding with Vision Language Models , year=
[39]

2023 , editor =

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle =. 2023 , editor =

2023
[40]

GPT-4V(ision) System Card , year =
[41]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. arXiv preprint arXiv:2308.12966 , year=

work page internal anchor Pith review arXiv
[42]

Transactions on Machine Learning Research , issn=

Multimodal Chain-of-Thought Reasoning in Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

2024
[43]

Lawrence and Parikh, Devi , title =

Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =
[44]

2024 , url=

Dongping Chen and Ruoxi Chen and Shilin Zhang and Yaochen Wang and Yinuo Liu and Huichi Zhou and Qihui Zhang and Yao Wan and Pan Zhou and Lichao Sun , booktitle=. 2024 , url=

2024
[45]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[46]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Tong, Shengbang and Liu, Zhuang and Zhai, Yuexiang and Ma, Yi and LeCun, Yann and Xie, Saining , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

2024
[47]

2024 , eprint=

BLINK: Multimodal Large Language Models Can See but Not Perceive , author=. 2024 , eprint=

2024
[48]

Science China Information Sciences , volume=

Woodpecker: Hallucination correction for multimodal large language models , author=. Science China Information Sciences , volume=. 2024 , publisher=

2024
[49]

2024 , eprint=

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs , author=. 2024 , eprint=

2024
[50]

Why Is Spatial Reasoning Hard for

Chen, Shiqi and Zhu, Tongyao and Zhou, Ruochen and Zhang, Jinghan and Gao, Siyang and Niebles, Juan Carlos and Geva, Mor and He, Junxian and Wu, Jiajun and Li, Manling , booktitle =. Why Is Spatial Reasoning Hard for. 2025 , editor =

2025
[51]

Workshop on Foundation Models Meet Embodied Agents at CVPR 2025 , year=

Visual Planning: Let's Think Only with Images , author=. Workshop on Foundation Models Meet Embodied Agents at CVPR 2025 , year=

2025
[52]

2025 , eprint=

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective , author=. 2025 , eprint=

2025
[53]

Sensors , VOLUME =

Watanabe, Yuto and Togo, Ren and Maeda, Keisuke and Ogawa, Takahiro and Haseyama, Miki , TITLE =. Sensors , VOLUME =. 2023 , NUMBER =

2023
[54]

2025 , eprint=

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO , author=. 2025 , eprint=

2025
[55]

Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing , year=

Kim, Yoonjeon and Ryu, Soohyun and Jung, Yeonsung and Lee, Hyunkoo and Kim, Joowon and Yang, June Yong and Hwang, Jaeryong and Yang, Eunho , booktitle=. Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing , year=
[56]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[57]

The First Workshop on Multimodal Knowledge and Language Modeling , year=

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning , author=. The First Workshop on Multimodal Knowledge and Language Modeling , year=
[58]

doi: 10.18653/v1/2023.emnlp-main.568

Kamath, Amita and Hessel, Jack and Chang, Kai-Wei. What ' s ``up'' with vision-language models? Investigating their struggle with spatial reasoning. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.568

work page doi:10.18653/v1/2023.emnlp-main.568 2023
[59]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and ichter, brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =
[60]

2023 , eprint=

Structured World Representations in Maze-Solving Transformers , author=. 2023 , eprint=

2023
[61]

2025 , eprint=

Visual Planning: Let's Think Only with Images , author=. 2025 , eprint=

2025
[62]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

2024
[63]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Imagine While Reasoning in Space: Multimodal Visualization-of-Thought , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025
[64]

2025 , eprint=

ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL , author=. 2025 , eprint=

2025
[65]

Proceedings of the AAAI conference on artificial intelligence , volume=

An empirical study of gpt-3 for few-shot knowledge-based vqa , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[66]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu and Shengming Yin and Weizhen Qi and Xiaodong Wang and Zecheng Tang and Nan Duan , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2303.04671 , eprinttype =. 2303.04671 , timestamp =

work page internal anchor Pith review doi:10.48550/arxiv.2303.04671 2023
[67]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Gupta, Tanmay and Kembhavi, Aniruddha , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2023 , pages =

2023
[68]

Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning

Tang, Yihong and Qu, Ao and Wang, Zhaokai and Zhuang, Dingyi and Wu, Zhaofeng and Ma, Wei and Wang, Shenhao and Zheng, Yunhan and Zhao, Zhan and Zhao, Jinhua. Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:...

work page doi:10.18653/v1/2025.findings-emnlp.217 2025
[69]

2023 , eprint=

A Configurable Library for Generating and Manipulating Maze Datasets , author=. 2023 , eprint=

2023
[70]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling , author=. arXiv preprint arXiv:2501.17811 , year=

work page internal anchor Pith review arXiv
[71]

2025 , eprint=

Video models are zero-shot learners and reasoners , author=. 2025 , eprint=

2025
[72]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , url =

Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , booktitle =. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , url =
[73]

and Shechtman, Eli and Wang, Oliver , title =

Zhang, Richard and Isola, Phillip and Efros, Alexei A. and Shechtman, Eli and Wang, Oliver , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
[74]

Emerging Properties in Unified Multimodal Pretraining

Emerging Properties in Unified Multimodal Pretraining , author =. arXiv preprint arXiv:2505.14683 , year =

work page internal anchor Pith review arXiv
[75]

Gemini 3 Pro Image (Nano Banana Pro) , howpublished =

Google DeepMind , year =. Gemini 3 Pro Image (Nano Banana Pro) , howpublished =
[76]

2025 , eprint=

Seedream 4.0: Toward Next-generation Multimodal Image Generation , author=. 2025 , eprint=

2025
[77]

2025 , eprint=

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space , author=. 2025 , eprint=

2025
[78]

Unified Reward Model for Multimodal Understanding and Generation

Unified reward model for multimodal understanding and generation , author=. arXiv preprint arXiv:2503.05236 , year=

work page internal anchor Pith review arXiv