PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

He Zhu; Junyou Su; Minxin Chen; Wenjia Zhang; Wen Wang; Yijie Deng

arxiv: 2606.05744 · v1 · pith:YPCLSIQSnew · submitted 2026-06-04 · 💻 cs.CL

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

Minxin Chen , He Zhu , Junyou Su , Wen Wang , Yijie Deng , Wenjia Zhang This is my paper

Pith reviewed 2026-06-28 02:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords spatial planningvision language modelsbenchmarkurban planningmultimodal evaluationplanning mapscognitive pipeline

0 comments

The pith

VLMs show 27% improvement on spatial planning map tasks but still fail at implementation requiring policy judgment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes PlanBench-V as a benchmark to test how well vision-language models can interpret spatial planning maps used in territorial governance. It creates a dataset of 223 maps with 1629 expert-curated question-answer pairs and evaluates models across four stages of capability: perceiving the map, reasoning about it, associating with policy, and implementing decisions. Results indicate that models from 2026 outperform 2025 models substantially yet all have trouble with the final stage that demands evaluative judgment and constraint awareness. This matters for anyone relying on AI to assist in urban planning and public decision processes where accurate map reading directly affects outcomes.

Core claim

PlanBench-V is introduced as the first comprehensive benchmark for VLMs on spatial planning map interpretation, featuring the Spatial Planning Map Database with expert annotations and a four-stage evaluation framework of Perception, Reasoning, Association, and Implementation. Experiments demonstrate clear progress between model generations but persistent limitations, particularly in implementation tasks that require policy sensitivity and constraint-aware decision-making.

What carries the argument

The theory-informed evaluation framework that breaks down planning map interpretation into four progressive capabilities: Perception, Reasoning, Association, and Implementation.

If this is right

Current VLMs have made measurable gains in handling spatial planning maps over one year.
All tested models continue to underperform on tasks needing professional judgment and policy knowledge.
Domain-specific benchmarks like this expose gaps not visible in general visual understanding tests.
Future development of VLMs for professional use will require targeted improvements in evaluative and constraint-handling abilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying this benchmark to train models could lead to better AI tools for planners.
Similar benchmarks might be developed for other specialized visual domains like architecture or environmental science.
Human performance baselines on the same questions would help calibrate how close models are to expert level.

Load-bearing premise

That the division into four capabilities accurately reflects the real cognitive process of interpreting planning maps and that the expert annotations represent correct answers for model evaluation.

What would settle it

If professional planners achieve significantly lower accuracy than the top models on the implementation questions, or if models match expert performance across all four capability levels.

Figures

Figures reproduced from arXiv: 2606.05744 by He Zhu, Junyou Su, Minxin Chen, Wenjia Zhang, Wen Wang, Yijie Deng.

**Figure 1.** Figure 1: Overview of the proposed PlanBench-V In summary, our contributions are as follows: (1) We introduce PlanBench-V, the first benchmark explicitly designed to evaluate vision-language models in the context of spatial planning map interpretation, grounded in real-world planning documents and expert knowledge. (2) We propose a four-dimensional assessment framework—Perception, Reasoning, Association, and Impleme… view at source ↗

**Figure 2.** Figure 2: : Geographical Distribution of Planning Cases in SPMD [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: : Performance of 17 Models on Planbench-V Vision-Language Tasks across [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: : Image resolution sensitivity analysis. Boxplots of per-model scores [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their interpretation, however, requires fine-grained visual perception, spatial reasoning, and policy-informed professional judgment, creating major challenges for both human learners and AI systems. With the rapid progress of Vision-Language Models (VLMs), their use in urban planning analysis is gaining attention, yet existing multimodal benchmarks mainly target general visual understanding and overlook the domain-specific cognitive processes of planning practice. To address this gap, we introduce PlanBench-V, the first comprehensive benchmark for evaluating VLMs in spatial planning map interpretation. We first build the Spatial Planning Map Database (SPMD), an expert-annotated dataset of 223 planning maps and 1629 question-answer pairs curated by professional planners, covering diverse geographic regions and cartographic styles. We then propose a theory-informed evaluation framework assessing four progressive capabilities: Perception, Reasoning, Association, and Implementation, corresponding to the cognitive pipeline of planning map interpretation. Extensive experiments across two generations of VLMs show clear progress but persistent limitations. The best 2026 agentic reasoning model, Qwen3.6-Plus, substantially outperforms the best 2025 model, GPT-4o, by 27%. Nevertheless, all models still struggle with implementation-oriented tasks requiring evaluative judgment, policy sensitivity, and constraint-aware decision-making. These findings reveal fundamental limitations of current VLMs in professional planning contexts and highlight the need for domain-adaptive multimodal reasoning frameworks. Code and data are available at https://plangpt.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PlanBench-V creates a domain-specific VLM benchmark for planning maps with expert data but the 27% gain and framework rest on unverified annotations.

read the letter

The main thing to know is that this paper builds PlanBench-V as the first benchmark for VLMs on spatial planning map interpretation. It includes the SPMD dataset of 223 maps and 1629 expert-annotated QA pairs plus a four-stage framework of Perception, Reasoning, Association, and Implementation.

The work fills a clear gap by moving past general visual benchmarks to tasks tied to territorial governance and policy. The experiments compare models across years and report that Qwen3.6-Plus beats GPT-4o by 27 percent while all models still fail on implementation questions that require judgment and constraint handling. The domain focus and the split into progressive capabilities give the paper a practical angle.

The soft spot is the missing validation for the annotations and the framework. The abstract gives no inter-annotator agreement numbers, no disagreement resolution details, and no check that the four categories actually match how planners read maps. That leaves the ground truth and the performance differences open to doubt until the methods section is examined.

This paper is for researchers who build or evaluate multimodal models for applied domains like urban planning. Readers who care about benchmark construction or domain adaptation will find the dataset and the reported limitations worth looking at.

It deserves peer review so the annotation process and category assignments can be checked directly.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces PlanBench-V, the first benchmark for VLMs on spatial planning map interpretation. It constructs the SPMD dataset (223 maps, 1629 expert-annotated QA pairs by professional planners across regions and styles), proposes a four-capability framework (Perception, Reasoning, Association, Implementation) aligned to the cognitive pipeline of map interpretation, and reports experiments on two generations of VLMs showing a 27% gain for Qwen3.6-Plus over GPT-4o with persistent failures on implementation tasks requiring evaluative judgment and policy sensitivity.

Significance. If the ground-truth annotations and capability partitioning hold, the benchmark fills a clear gap in domain-specific multimodal evaluation for professional planning contexts that existing general VLM benchmarks overlook. The public release of data and code is a concrete strength that would enable follow-on work on domain-adaptive reasoning.

major comments (2)

[Dataset Construction] The curation process for the 1629 QA pairs (described in the abstract and Dataset section) reports expert annotation by professional planners but provides no inter-annotator agreement statistics, disagreement resolution protocol, or post-hoc validation of the four-category assignments. This directly undermines the reliability of the headline 27% performance gap and the subset-specific claims about implementation-task failures.
[Evaluation Framework] The Evaluation Framework section asserts that the four progressive capabilities correspond to the cognitive pipeline of planning map interpretation, yet reports no validation metrics, expert review of category assignments, or sensitivity analysis showing that mislabeling would not alter the reported model rankings or limitation conclusions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of dataset reliability and framework validation that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Dataset Construction] The curation process for the 1629 QA pairs (described in the abstract and Dataset section) reports expert annotation by professional planners but provides no inter-annotator agreement statistics, disagreement resolution protocol, or post-hoc validation of the four-category assignments. This directly undermines the reliability of the headline 27% performance gap and the subset-specific claims about implementation-task failures.

Authors: We agree that the manuscript lacks quantitative inter-annotator agreement statistics, a detailed disagreement resolution protocol, and explicit post-hoc validation of the four-category assignments. The annotations were performed by professional planners following internal guidelines, but these elements were not reported. In the revised manuscript we will expand the Dataset section with: (i) the annotation guidelines and disagreement resolution process (consensus discussion among annotators), (ii) IAA statistics on the subset of questions annotated by multiple planners, and (iii) confirmation of category assignments via expert review. These additions will directly bolster the reliability of the reported 27% gap and the implementation-task analysis. revision: yes
Referee: [Evaluation Framework] The Evaluation Framework section asserts that the four progressive capabilities correspond to the cognitive pipeline of planning map interpretation, yet reports no validation metrics, expert review of category assignments, or sensitivity analysis showing that mislabeling would not alter the reported model rankings or limitation conclusions.

Authors: The referee is correct that the current manuscript does not include validation metrics, documented expert review of the category assignments, or sensitivity analysis for potential mislabeling. The four-capability framework was developed from cognitive models in the planning literature, but these supporting details were omitted. In the revised Evaluation Framework section we will add: (i) a description of the expert review process used to confirm category assignments, (ii) alignment/validation metrics between the assigned categories and expert judgments on a held-out sample, and (iii) a sensitivity analysis showing that moderate rates of mislabeling do not change model rankings or the core conclusion regarding persistent implementation-task failures. These changes will strengthen the framework without altering the experimental findings. revision: yes

Circularity Check

0 steps flagged

New benchmark dataset with expert annotations; no derivation reduces to inputs by construction

full rationale

The paper introduces SPMD (223 maps, 1629 QA pairs) and a four-capability framework as a new contribution. No equations, fitted parameters, self-citations, or renamings appear in the provided text. The evaluation framework is presented as theory-informed but does not define capabilities in terms of model outputs or reduce any result to prior fitted values. This is a standard independent benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumptions that the four-stage framework accurately models planning cognition and that expert annotations constitute reliable ground truth; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The cognitive pipeline of planning map interpretation decomposes into Perception, Reasoning, Association, and Implementation stages.
This decomposition underpins the proposed evaluation framework.
domain assumption Expert-annotated QA pairs curated by professional planners provide valid ground truth for VLM evaluation.
The dataset construction and performance claims depend on this.

pith-pipeline@v0.9.1-grok · 5832 in / 1292 out tokens · 52226 ms · 2026-06-28T02:07:10.665669+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 21 canonical work pages · 8 internal anchors

[1]

VQA: Visual Question Answering. October. ArXiv:1505.00468 [cs], Available from: http://arxiv.org/abs/1505.00468. Albrechts, L., Healey, P., and Kunzmann, K.R.,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Alexander, C., 1977.A pattern language: Towns, buildings, construction

Strategic spatial planning and regional governance in europe.Journal of the American Planning Association, 69 (2), 113–129. Alexander, C., 1977.A pattern language: Towns, buildings, construction. Oxford University Press. Arnheim, R., 1969.Visual thinking. University of California Press. Bai, J.,et al.,

1977
[3]

Bertin, J., 1983.Semiology of graphics: Diagrams, networks, maps

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. Bertin, J., 1983.Semiology of graphics: Diagrams, networks, maps. University of Wisconsin Press. Burgess, R. and Carmona, M.I.,

1983
[4]

Available from: https://arxiv.org/abs/2211.08545

Mapqa: A dataset for question answering on choropleth maps. Available from: https://arxiv.org/abs/2211.08545. Chen, S.,et al.,

work page arXiv
[5]

Chen, Z.,et al.,

Bring reason to vision: Understanding perception and reasoning through model merging.arXiv preprint arXiv:2505.05464. Chen, Z.,et al.,

work page arXiv
[6]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238. DeepMind, G.,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models’ Understanding of Engineering Documentation. August. D¨ uhr, S., 2007.The visual language of spatial planning: Exploring cartographic repre- sentations for spatial planning in europe. 1st ed. London: Routledge. Available from: https://doi.org/10.4324/9780203965818. D¨ uhr, S.,

work page doi:10.4324/9780203965818 2007
[8]

Seed1.5-VL Technical Report

Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062. Harley, J.B.,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Harley, J.B., 2002.The new nature of maps: Essays in the history of cartography

Deconstructing the map.Cartographica: The International Journal for Geographic Information and Geovisualization, 26 (2), 1–20. Harley, J.B., 2002.The new nature of maps: Essays in the history of cartography. Baltimore: Johns Hopkins University Press. Healey, P., 1997.Collaborative planning: Shaping places in fragmented societies. UBC Press. Huang, Y.,et al.,

2002
[10]

Hurst, A.,et al.,

Aesbench: An expert benchmark for multimodal large language models on image aesthetics perception.arXiv preprint arXiv:2401.08276. Hurst, A.,et al.,

work page arXiv
[11]

GPT-4o System Card

Gpt-4o system card.arXiv preprint arXiv:2410.21276. Koffka, K., 1935.Principles of gestalt psychology. Harcourt, Brace and Company. Kress, G. and van Leeuwen, T., 2006.Reading images: The grammar of visual design. 2nd ed. Routledge. Krishna, R.,et al.,

work page internal anchor Pith review Pith/arXiv arXiv 1935
[12]

Shamma, et al

Available from: https://doi.org/10.1007/s11263-016-0981-7. Lai, Y.,et al.,

work page doi:10.1007/s11263-016-0981-7
[13]

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models,

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models.arXiv preprint arXiv:2503.13939. Li, C.,et al.,

work page arXiv
[14]

Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs.arXiv preprint arXiv:2308.03349, 2023

Scigraphqa: A large-scale synthetic multi-turn question- answering dataset for scientific graphs. Available from: https://arxiv.org/abs/2308.03349. Li, Z.,et al.,

work page arXiv
[15]

Lin, T.Y.,et al.,

Designprobe: A graphic design benchmark for multimodal large language models.arXiv preprint arXiv:2404.14801. Lin, T.Y.,et al.,

work page arXiv
[16]

Microsoft COCO: Common Objects in Context

Microsoft COCO: Common Objects in Context. February. ArXiv:1405.0312 [cs], Available from: http://arxiv.org/abs/1405.0312. Liu, H.,et al.,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Visual Instruction Tuning

Visual instruction tuning. Available from: https://arxiv.org/abs/2304.08485. Lynch, K., 1984.Good city form. MIT press. Mathew, M.,et al.,

work page internal anchor Pith review Pith/arXiv arXiv 1984
[18]

Mayer, R.E., 2005.The cambridge handbook of multimedia learning

Available from: https://arxiv.org/abs/2008.08899. Mayer, R.E., 2005.The cambridge handbook of multimedia learning. Cambridge University Press. Moroni, S. and Lorini, G.,

work page arXiv 2008
[19]

MIT Press

Palmer, S.E., 1999.Vision science: Photons to phenomenology. MIT Press. Pan, J.,et al.,

1999
[20]

Qian, T.,et al.,

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning.arXiv preprint arXiv:2502.19634. Qian, T.,et al.,

work page arXiv
[21]

Available from: https://arxiv.org/abs/2405.08807

Scifibench: Benchmarking large multimodal models for scientific figure interpretation. Available from: https://arxiv.org/abs/2405.08807. Roberts, J.,et al.,

work page arXiv
[22]

Shen, H.,et al.,

Charting New Territories: Exploring the geographic and geospatial capabilities of multimodal LLMs.arXiv preprint arXiv:2311.14656. Shen, H.,et al.,

work page arXiv
[23]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615. Sima, C.,et al.,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

A framework for theory and practice in landscape planning.Process archi- tecture, 127, 12–31. Wang, P.,et al., 2024a. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. Available from: https://arxiv.org/abs/2409.12191. Wang, W.,et al., 2024b. Needle In A Multimodal Haystack. October. ArXiv:2406.07230 [cs], Available fro...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Zhu, H., Chen, G., and Zhang, W., 2025a

Uniaa: A unified multi-modal image aesthetic assessment baseline and benchmark.arXiv preprint arXiv:2404.09619. Zhu, H., Chen, G., and Zhang, W., 2025a. PlanGPT: Enhancing urban planning with a tailored agent framework.In: G. Rehm and Y. Li, eds.Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Tra...

work page arXiv 2025

[1] [1]

VQA: Visual Question Answering. October. ArXiv:1505.00468 [cs], Available from: http://arxiv.org/abs/1505.00468. Albrechts, L., Healey, P., and Kunzmann, K.R.,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Alexander, C., 1977.A pattern language: Towns, buildings, construction

Strategic spatial planning and regional governance in europe.Journal of the American Planning Association, 69 (2), 113–129. Alexander, C., 1977.A pattern language: Towns, buildings, construction. Oxford University Press. Arnheim, R., 1969.Visual thinking. University of California Press. Bai, J.,et al.,

1977

[3] [3]

Bertin, J., 1983.Semiology of graphics: Diagrams, networks, maps

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. Bertin, J., 1983.Semiology of graphics: Diagrams, networks, maps. University of Wisconsin Press. Burgess, R. and Carmona, M.I.,

1983

[4] [4]

Available from: https://arxiv.org/abs/2211.08545

Mapqa: A dataset for question answering on choropleth maps. Available from: https://arxiv.org/abs/2211.08545. Chen, S.,et al.,

work page arXiv

[5] [5]

Chen, Z.,et al.,

Bring reason to vision: Understanding perception and reasoning through model merging.arXiv preprint arXiv:2505.05464. Chen, Z.,et al.,

work page arXiv

[6] [6]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238. DeepMind, G.,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models’ Understanding of Engineering Documentation. August. D¨ uhr, S., 2007.The visual language of spatial planning: Exploring cartographic repre- sentations for spatial planning in europe. 1st ed. London: Routledge. Available from: https://doi.org/10.4324/9780203965818. D¨ uhr, S.,

work page doi:10.4324/9780203965818 2007

[8] [8]

Seed1.5-VL Technical Report

Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062. Harley, J.B.,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Harley, J.B., 2002.The new nature of maps: Essays in the history of cartography

Deconstructing the map.Cartographica: The International Journal for Geographic Information and Geovisualization, 26 (2), 1–20. Harley, J.B., 2002.The new nature of maps: Essays in the history of cartography. Baltimore: Johns Hopkins University Press. Healey, P., 1997.Collaborative planning: Shaping places in fragmented societies. UBC Press. Huang, Y.,et al.,

2002

[10] [10]

Hurst, A.,et al.,

Aesbench: An expert benchmark for multimodal large language models on image aesthetics perception.arXiv preprint arXiv:2401.08276. Hurst, A.,et al.,

work page arXiv

[11] [11]

GPT-4o System Card

Gpt-4o system card.arXiv preprint arXiv:2410.21276. Koffka, K., 1935.Principles of gestalt psychology. Harcourt, Brace and Company. Kress, G. and van Leeuwen, T., 2006.Reading images: The grammar of visual design. 2nd ed. Routledge. Krishna, R.,et al.,

work page internal anchor Pith review Pith/arXiv arXiv 1935

[12] [12]

Shamma, et al

Available from: https://doi.org/10.1007/s11263-016-0981-7. Lai, Y.,et al.,

work page doi:10.1007/s11263-016-0981-7

[13] [13]

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models,

Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models.arXiv preprint arXiv:2503.13939. Li, C.,et al.,

work page arXiv

[14] [14]

Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs.arXiv preprint arXiv:2308.03349, 2023

Scigraphqa: A large-scale synthetic multi-turn question- answering dataset for scientific graphs. Available from: https://arxiv.org/abs/2308.03349. Li, Z.,et al.,

work page arXiv

[15] [15]

Lin, T.Y.,et al.,

Designprobe: A graphic design benchmark for multimodal large language models.arXiv preprint arXiv:2404.14801. Lin, T.Y.,et al.,

work page arXiv

[16] [16]

Microsoft COCO: Common Objects in Context

Microsoft COCO: Common Objects in Context. February. ArXiv:1405.0312 [cs], Available from: http://arxiv.org/abs/1405.0312. Liu, H.,et al.,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Visual Instruction Tuning

Visual instruction tuning. Available from: https://arxiv.org/abs/2304.08485. Lynch, K., 1984.Good city form. MIT press. Mathew, M.,et al.,

work page internal anchor Pith review Pith/arXiv arXiv 1984

[18] [18]

Mayer, R.E., 2005.The cambridge handbook of multimedia learning

Available from: https://arxiv.org/abs/2008.08899. Mayer, R.E., 2005.The cambridge handbook of multimedia learning. Cambridge University Press. Moroni, S. and Lorini, G.,

work page arXiv 2008

[19] [19]

MIT Press

Palmer, S.E., 1999.Vision science: Photons to phenomenology. MIT Press. Pan, J.,et al.,

1999

[20] [20]

Qian, T.,et al.,

Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning.arXiv preprint arXiv:2502.19634. Qian, T.,et al.,

work page arXiv

[21] [21]

Available from: https://arxiv.org/abs/2405.08807

Scifibench: Benchmarking large multimodal models for scientific figure interpretation. Available from: https://arxiv.org/abs/2405.08807. Roberts, J.,et al.,

work page arXiv

[22] [22]

Shen, H.,et al.,

Charting New Territories: Exploring the geographic and geospatial capabilities of multimodal LLMs.arXiv preprint arXiv:2311.14656. Shen, H.,et al.,

work page arXiv

[23] [23]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615. Sima, C.,et al.,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

A framework for theory and practice in landscape planning.Process archi- tecture, 127, 12–31. Wang, P.,et al., 2024a. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. Available from: https://arxiv.org/abs/2409.12191. Wang, W.,et al., 2024b. Needle In A Multimodal Haystack. October. ArXiv:2406.07230 [cs], Available fro...

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Zhu, H., Chen, G., and Zhang, W., 2025a

Uniaa: A unified multi-modal image aesthetic assessment baseline and benchmark.arXiv preprint arXiv:2404.09619. Zhu, H., Chen, G., and Zhang, W., 2025a. PlanGPT: Enhancing urban planning with a tailored agent framework.In: G. Rehm and Y. Li, eds.Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Tra...

work page arXiv 2025