AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration

Haotian Li; Jianwei Hu; Jinshan Lai; Keyang Wang; Leyuan Wang; Liuyu Xiang; Qiang Ma; Yida Wang; Zhaofeng He; Zonghao Guo

arxiv: 2606.28049 · v1 · pith:PEHDWG3Onew · submitted 2026-06-26 · 💻 cs.CV

AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration

Haotian Li , Yida Wang , Leyuan Wang , Jinshan Lai , Keyang Wang , Zonghao Guo , Qiang Ma , Liuyu Xiang

show 2 more authors

Jianwei Hu Zhaofeng He

This is my paper

Pith reviewed 2026-06-29 04:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords AirGroundBenchmultimodal large language modelsspatial intelligenceUAV-UGV collaborationcross-view alignmentvision-language navigationembodied decision-making

0 comments

The pith

MLLMs perform well on single-view perception but consistently fail at cross-view alignment and transformation reasoning in air-ground settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds AirGroundBench to diagnose how multimodal large language models handle spatial understanding when drone and ground-vehicle views must be combined. It supplies synchronized observation pairs from 11 simulated environments and organizes questions into four levels that grow harder: basic perception, matching objects across views, reasoning through scale and rotation changes, and using that information for navigation actions. Tests on 13 models show the expected pattern that perception is relatively easy while alignment and transformation steps create sharp drops, and those drops carry forward into closed-loop decision tasks. Dual-view input improves results over single views yet leaves a clear distance from human performance. The work therefore treats geometric consistency across heterogeneous viewpoints as the central unsolved constraint for embodied MLLMs.

Core claim

Evaluations reveal that current MLLMs handle spatial perception adequately but encounter consistent difficulties with cross-view alignment and transformation-intensive reasoning; these difficulties carry over into sequential vision-language navigation, and while dual-view inputs yield measurable gains they do not close the gap to human performance.

What carries the argument

AirGroundBench, a benchmark built from 11 high-fidelity environments, 1,021 synchronized air-ground pairs, and approximately 62,000 four-option VQA instances plus 115 navigation episodes, annotated with cross-view object identities and metric 2D/3D boxes, and partitioned into ten task types across four capability dimensions.

If this is right

Deficits in cross-view alignment directly degrade performance on embodied decision-making sequences.
Dual-view inputs produce consistent but partial gains over single-view baselines.
Geometric consistency across reference frames remains the dominant bottleneck for these models.
Progress on alignment and transformation reasoning is required before reliable multi-agent navigation can be achieved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment failures are likely to appear in any multi-scale sensor fusion setting, such as satellite-plus-street-level imagery.
Adding explicit geometric modules or training objectives that penalize inconsistent cross-view predictions could be tested as a direct remedy.
Extending the benchmark to include dynamic occlusion patterns or changing lighting would stress-test whether the identified bottlenecks are robust.

Load-bearing premise

The simulated environments and observation pairs faithfully reproduce the scale mismatch, asymmetric occlusion, and reference-frame problems that arise in actual heterogeneous air-ground collaboration.

What would settle it

Demonstration that any of the tested models reaches human-level accuracy on the same task suite when the inputs are replaced by real-world synchronized UAV-UGV video streams would falsify the claim of a persistent geometric-consistency limitation.

Figures

Figures reproduced from arXiv: 2606.28049 by Haotian Li, Jianwei Hu, Jinshan Lai, Keyang Wang, Leyuan Wang, Liuyu Xiang, Qiang Ma, Yida Wang, Zhaofeng He, Zonghao Guo.

**Figure 2.** Figure 2: Task taxonomy of AirGroundBench. We organize 10 task types into four capability dimensions: spatial perception, cross-view alignment, spatial transformation and reasoning, and embodied decision-making. Each task is formulated to probe geometric consistency and cross-view spatial understanding under heterogeneous UAV– UGV observations. (P-Obj) probes recognition of object-level properties under viewpoint/s… view at source ↗

**Figure 3.** Figure 3: AirGroundBench dataset overview. (a) VQA question distribution across capability groups (Perception/Alignment/Reasoning) with per-task counts. (b) Distribution of shortest-path lengths for VLN episodes (meters). (c) Environment composition (City vs. Wild) and per-environment image-pair counts. This design directly controls the UAV–UGV baseline and viewpoint disparity, and makes cross-view scale calibrat… view at source ↗

read the original abstract

In recent years, multimodal large language models (MLLMs) have shown strong potential for embodied intelligence, yet their ability to maintain geometrically consistent spatial understanding across heterogeneous views remains under-evaluated. Existing benchmarks largely focus on single-agent, single-view perception, leaving a gap in the systematic assessment of collaborative air-ground settings, where multi-scale observations are complementary but introduce scale mismatch, asymmetric occlusion, and reference-frame inconsistencies. We present AirGroundBench, a diagnostic benchmark for evaluating multi-view spatial intelligence in heterogeneous UAV-UGV collaboration. AirGroundBench is built from 11 high-fidelity simulated environments with 1,021 synchronized air-ground observation pairs, yielding approximately 62,000 dual-view, four-option single-choice visual question answering instances and 115 closed-loop vision-language navigation episodes. It covers 10 task types organized into four progressively demanding capability dimensions: spatial perception, cross-view alignment, spatial transformation and reasoning, and embodied decision-making. To support geometry-grounded evaluation and analysis, we provide structured spatial annotations, including cross-view object identities and metric 2D and 3D bounding boxes. Evaluations of 13 representative MLLMs under UAV-only, UGV-only, and dual-view input settings reveal consistent bottlenecks: models perform relatively well on spatial perception but struggle with cross-view alignment and transformation-intensive reasoning, and these deficits propagate to sequential decision-making in vision-language navigation. Although dual-view inputs provide measurable gains over single-view variants, a persistent gap from human performance remains, highlighting geometric consistency as a key limitation of current embodied MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AirGroundBench gives a concrete testbed for cross-view spatial deficits in embodied MLLMs, but the all-simulated setup leaves open whether the reported bottlenecks reflect real deployment conditions.

read the letter

The main takeaway is that this paper builds AirGroundBench to probe how MLLMs handle spatial tasks when UAV and UGV views must be combined. It reports that models manage basic perception but consistently struggle with alignment across views and the transformations needed for reasoning, with those gaps showing up in navigation episodes as well.

The construction is the strongest part. They collected 1,021 synchronized pairs from 11 simulated environments, added structured annotations for object identities and metric boxes, and organized 10 task types into four increasing levels of demand. Running the same 13 models on UAV-only, UGV-only, and dual-view inputs produces a clean comparison that shows measurable gains from the second view without closing the human gap. That pattern is useful for anyone trying to diagnose where current architectures fall short on multi-agent geometry.

The soft spot is the exclusive reliance on simulation. The stress-test concern is fair: there are no real-robot captures, no statistics comparing simulated versus physical observation distributions, and no discussion of how sensor noise or irregular occlusions in the field might change the difficulty. With only 115 closed-loop navigation episodes, the claim that perception and reasoning deficits propagate to decision-making also rests on limited data. If the simulated reference-frame shifts turn out milder than actual air-ground deployments, the bottlenecks could be overstated.

This is for groups working on embodied MLLMs or heterogeneous robot teams who need a diagnostic beyond single-view benchmarks. A reader focused on benchmark design or spatial reasoning evaluation would get practical value from the task breakdown and baseline numbers.

The work shows honest engagement with the literature gap and produces reproducible-looking task definitions. It deserves peer review so referees can check the question-generation and annotation procedures and push for more on sim-to-real transfer.

Referee Report

2 major / 2 minor

Summary. The paper introduces AirGroundBench, a diagnostic benchmark for multi-view spatial intelligence in heterogeneous UAV-UGV collaboration. Constructed from 11 high-fidelity simulated environments yielding 1,021 synchronized observation pairs, it generates ~62,000 dual-view VQA instances and 115 closed-loop navigation episodes across 10 task types in four capability dimensions (spatial perception, cross-view alignment, spatial transformation/reasoning, embodied decision-making). Structured annotations include cross-view object identities and metric bounding boxes. Evaluations of 13 MLLMs under single- and dual-view settings show relative strength in perception but consistent weaknesses in alignment and transformation reasoning that propagate to navigation; dual-view inputs yield gains yet a persistent gap to human performance remains.

Significance. If the simulated tasks and annotations faithfully instantiate the geometric difficulties of scale mismatch, asymmetric occlusion, and reference-frame inconsistencies, the work would provide a valuable diagnostic tool for embodied MLLM research. The progressive task dimensions and geometry-grounded annotations are positive features that enable targeted analysis beyond aggregate accuracy.

major comments (2)

[§3 and §5] §3 (Benchmark Construction) and §5 (Experiments): The headline claim that observed bottlenecks reflect fundamental limitations in cross-view alignment and transformation reasoning rests on the unverified assumption that the 11 simulated environments and 1,021 pairs reproduce real-world scale mismatch, asymmetric occlusion, and reference-frame inconsistencies. No quantitative comparison of simulated vs. real observation statistics (e.g., occlusion distributions, scale ratios, or camera calibration errors) is reported, and no physical-robot validation is included. This is load-bearing for interpreting the results as general rather than simulator-specific.
[§4.2 and Table 2] §4.2 (Task Design) and Table 2: The four capability dimensions are presented as progressively demanding, yet the paper provides no ablation or correlation analysis showing that failure on transformation-intensive tasks causally drives the navigation deficits (as opposed to other factors such as instruction following or long-horizon planning). Without such evidence the propagation claim remains correlational.

minor comments (2)

[Abstract and §5] The abstract and §1 state that dual-view inputs provide 'measurable gains' but do not report effect sizes or statistical significance for the improvement over single-view baselines in the main results tables.
[Figure 3] Figure 3 (example observation pairs) would benefit from explicit scale bars or metric annotations to illustrate the claimed scale mismatch between air and ground views.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments on our manuscript. We address each of the major comments below, providing clarifications and indicating planned revisions where appropriate.

read point-by-point responses

Referee: [§3 and §5] §3 (Benchmark Construction) and §5 (Experiments): The headline claim that observed bottlenecks reflect fundamental limitations in cross-view alignment and transformation reasoning rests on the unverified assumption that the 11 simulated environments and 1,021 pairs reproduce real-world scale mismatch, asymmetric occlusion, and reference-frame inconsistencies. No quantitative comparison of simulated vs. real observation statistics (e.g., occlusion distributions, scale ratios, or camera calibration errors) is reported, and no physical-robot validation is included. This is load-bearing for interpreting the results as general rather than simulator-specific.

Authors: We agree that the benchmark is constructed in simulation and that we do not include direct quantitative comparisons to real-world data or physical robot experiments. Our environments are selected from high-fidelity simulators to capture the key geometric challenges mentioned, including variations in scale, occlusion, and viewpoints. However, we recognize that without explicit sim-to-real validation, the results should be interpreted within the context of simulated settings. In the revised version, we will expand the discussion in §3 and §5 to explicitly state the assumptions and limitations regarding generalization to real-world scenarios, and we will include qualitative examples comparing simulated observations to typical real UAV/UGV imagery where possible. revision: yes
Referee: [§4.2 and Table 2] §4.2 (Task Design) and Table 2: The four capability dimensions are presented as progressively demanding, yet the paper provides no ablation or correlation analysis showing that failure on transformation-intensive tasks causally drives the navigation deficits (as opposed to other factors such as instruction following or long-horizon planning). Without such evidence the propagation claim remains correlational.

Authors: The task dimensions are designed with logical progression in mind, where spatial transformation and reasoning build upon alignment and perception, and the navigation tasks require integrating these capabilities. The results in Table 2 show that models with lower performance on transformation tasks also exhibit larger gaps in navigation. While we did not perform explicit causal ablations, the per-dimension breakdowns support the propagation narrative. To address this, we will add a correlation analysis between dimension-specific accuracies and navigation success rates in the revised manuscript to provide more quantitative support for the claim. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivations or self-referential reductions

full rationale

The paper introduces AirGroundBench as a diagnostic benchmark consisting of simulated environments, observation pairs, VQA instances, and navigation episodes. It reports empirical performance of 13 MLLMs across task types and input settings. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text or abstract. The central claims rest on direct evaluation results rather than any chain that reduces to its own construction by definition. This is a standard empirical study; the absence of mathematical derivation chains makes circularity analysis inapplicable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical validity of the simulated environments and task design; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5849 in / 1107 out tokens · 38012 ms · 2026-06-29T04:31:45.613338+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 29 canonical work pages · 8 internal anchors

[1]

AI, M.: Kimi large language model.https://kimi.moonshot.cn(2024)

2024
[2]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

AI, Z.: Glm-4: Open multilingual multimodal large model. arXiv preprint arXiv:2406.12793 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3674–3683 (2018).https://doi.org/10....

work page doi:10.1109/cvpr.2018.00387 2018
[4]

In: CVPR (2018)

Anderson, P., et al.: Vision-and-language navigation: Interpreting visually- grounded navigation instructions in real environments. In: CVPR (2018)

2018
[5]

Anthropic: Claude 3 model card.https://www.anthropic.com/news/claude-3- family(2024)

2024
[6]

In: ICCV (2015)

Antol, S., et al.: Vqa: Visual question answering. In: ICCV (2015)

2015
[7]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., et al.: Qwen-vl: A versatile vision-language model for understanding, lo- calization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Basappa, A., Goel, P., Karra, A., Karra, A., Gilmore, A., Zhu, K.: Amvicc: A novel benchmark for cross-modal failure mode profiling for vlms and igms (jan 2026),http://arxiv.org/abs/2601.17037v1, published: 2026-01- 20T00:06:58Z; Updated: 2026-01-20T00:06:58Z; Categories: cs.CV; cs.AI; PDF: https://arxiv.org/pdf/2601.17037v1.pdf

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

ByteDance: Doubao large model.https://www.volcengine.com/product/doubao (2024)

2024
[10]

AirNav: A Large-Scale UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

Cai, H., Rao, Y., Huang, L., Zhong, Z., Dong, J., Tan, J., Lu, W., Zhong, R.: Airnav: A large-scale real-world uav vision-and-language navigation dataset with natural and diverse instructions (jan 2026),http://arxiv.org/abs/2601. 03707v1, published: 2026-01-07T08:46:09Z; Updated: 2026-01-07T08:46:09Z; Cat- egories: cs.CL; PDF: https://arxiv.org/pdf/2601.0...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR)

Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). pp. 16516–16525 (2022).https://doi.org/10.1109/CVPR52688. 2022 . 01604,https : / / openaccess . thecvf . com ...

work page doi:10.1109/cvpr52688 2022
[12]

MM-UA VBench: How well do multimodal large language models see, think, and plan in low-altitude uav scenarios?

Dai, S., Ma, Z., Luo, Z., Yang, X., Huang, Y., Zhang, W., Chen, C., Guo, Z., Xu, W., Sun, Y., Sun, M.: Mm-uavbench: How well do multimodal large language mod- els see, think, and plan in low-altitude uav scenarios? (dec 2025),http://arxiv. org / abs / 2512 . 23219v1, published: 2025-12-29T05:49:54Z; Updated: 2025-12- 29T05:49:54Z; Categories: cs.CV; PDF: ...

work page arXiv 2025
[13]

Gao, C., Zhao, B., Zhang, W., Mao, J., Zhang, J., Zheng, Z., Man, F., Fang, J., Zhou, Z., Cui, J., Chen, X., Li, Y.: EmbodiedCity: A Benchmark Plat- form for Embodied Agent in Real-world City Environment (Oct 2024).https: //doi.org/10.48550/arXiv.2410.09604,http://arxiv.org/abs/2410.09604, arXiv:2410.09604 [cs] titleTranslation: TLDR: A benchmark platform...

work page doi:10.48550/arxiv.2410.09604 2024
[14]

Intelligence32(2), 175–191 (2004).https://doi.org/10

Hegarty, M., Waller, D.: A dissociation between mental rotation and perspective- taking spatial abilities. Intelligence32(2), 175–191 (2004).https://doi.org/10. 1016/j.intell.2003.12.001

2004
[15]

In: Advances in Neu- ral Information Processing Systems (NeurIPS) (2022),https://papers.neurips

Hu, Y., Fang, S., Lei, Z., Zhong, Y., Chen, S.: Where2comm: Communication- efficient collaborative perception via spatial confidence maps. In: Advances in Neu- ral Information Processing Systems (NeurIPS) (2022),https://papers.neurips. cc / paper _ files / paper / 2022 / file / 1f5c5cd01b864d53cc5fa0a3472e152e - Paper-Conference.pdf

2022
[16]

In: CVPR (2019)

Hudson, D., Manning, C.: Gqa: A new dataset for real-world visual reasoning. In: CVPR (2019)

2019
[17]

Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., Yi, L.: Om- niSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Lan- guage Models (Sep 2025).https://doi.org/10.48550/arXiv.2506.03135,http: //arxiv.org/abs/2506.03135,arXiv:2506.03135[cs]titleTranslation:TLDR:Om- niSpatial is introduced, a comprehensive and challenging b...

work page doi:10.48550/arxiv.2506.03135 2025
[18]

In: ECCV (2020)

Krantz, J., et al.: Beyond the nav-graph: Vision-and-language navigation in con- tinuous environments. In: ECCV (2020)

2020
[19]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 4392–4412 (2020).https://doi.org/10.18653/v1/ 2020.emnlp-main.356,https://aclanthology.org/2020....

work page doi:10.18653/v1/ 2020
[20]

SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Lee, K., Lee, I., Kwak, M., Ryu, K., Hong, J., Park, J.: Spatialmosaic: A multi- view vlm dataset for partial visibility (dec 2025),http://arxiv.org/abs/2512. 23365v1, published: 2025-12-29T10:48:54Z; Updated: 2025-12-29T10:48:54Z; Cat- egories: cs.CV; PDF: https://arxiv.org/pdf/2512.23365v1.pdf

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Li, D., Li, H., Wang, Z., Yan, Y., Zhang, H., Chen, S., Hou, G., Jiang, S., Zhang, W., Shen, Y., Lu, W., Zhuang, Y.: ViewSpatial-Bench: Evaluating Multi- perspective Spatial Localization in Vision-Language Models (Sep 2025).https: //doi.org/10.48550/arXiv.2505.21500,http://arxiv.org/abs/2505.21500, arXiv:2505.21500 [cs] titleTranslation: - TLDR: This work...

work page doi:10.48550/arxiv.2505.21500 2025
[22]

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?arXiv:2503.23765, 2025

Li, Y., Zhang, Y., Lin, T., Liu, X., Cai, W., Liu, Z., Zhao, B.: STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding? (Jul 2025). https://doi.org/10.48550/arXiv.2503.23765,http://arxiv.org/abs/2503. 23765, arXiv:2503.23765 [cs] titleTranslation: Sti-benchMllms TLDR: STI-Bench is introduced, a benchmark designed to evaluate MLLMs’s ...

work page doi:10.48550/arxiv.2503.23765 2025
[23]

org / abs / 2512

Liu, X., Liu, Y., Qiu, H., Qirong, Y., Lian, Z.: Indooruav: Bench- marking vision-language uav navigation in continuous indoor environments (dec 2025),http : / / arxiv . org / abs / 2512 . 19024v1, published: 2025-12- 22T04:42:35Z; Updated: 2025-12-22T04:42:35Z; Categories: cs.RO; cs.AI; PDF: https://arxiv.org/pdf/2512.19024v1.pdf AirGroundBench 17

work page arXiv 2025
[24]

Clarendon Press, Oxford University Press (1978)

O’Keefe, J., Nadel, L.: The Hippocampus as a Cognitive Map. Clarendon Press, Oxford University Press (1978)

1978
[25]

OpenAI: Gpt-4v(ision) system card.https://openai.com/research/gpt- 4v- system-card(2023)

2023
[26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W.Y., Shen, C., van den Hen- gel, A.: Reverie: Remote embodied visual referring expression in real indoor en- vironments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9989–9998 (2020).https://doi.org/10. 1109/CVPR42600.2020.01000,https://openaccess.thecvf.co...

work page arXiv 2020
[27]

Science 171(3972), 701–703 (1971).https://doi.org/10.1126/science.171.3972.701

Shepard, R.N., Metzler, J.: Mental rotation of three-dimensional objects. Science 171(3972), 701–703 (1971).https://doi.org/10.1126/science.171.3972.701

work page doi:10.1126/science.171.3972.701 1971
[28]

Sohn, T.S., Dillitzer, M., Corso, J.J., Sax, E.: Embodied4C: Measuring What Mat- ters for Embodied Vision-Language Navigation (Dec 2025).https://doi.org/10. 48550/arXiv.2512.18028,http://arxiv.org/abs/2512.18028, arXiv:2512.18028 [cs] titleTranslation: 4C- TLDR: Comprehensive evaluation across ten state-of-the- art VLMs and four embodied control baselines...

work page arXiv 2025
[29]

arXiv preprint arXiv:2312.14838 (2023)

Sun, Y., et al.: Ernie 4.0: Knowledge enhanced foundation model. arXiv preprint arXiv:2312.14838 (2023)

work page arXiv 2023
[30]

Gemini: A Family of Highly Capable Multimodal Models

Team, G.D.: Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Team, Q.: Qwen2-vl: Enhancing vision-language models with multimodal reason- ing. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Tencent: Hunyuan large model.https://hunyuan.tencent.com(2024)

2024
[33]

Cognitive maps in rats and men.Psychological Review, 55(4):189–208, 1948

Tolman, E.C.: Cognitive maps in rats and men. Psychological Review55(4), 189– 208 (1948).https://doi.org/10.1037/h0061626

work page doi:10.1037/h0061626 1948
[34]

In: IEEE/CVF International Conference on Computer Vision

Wang, Z., Li, X., Yang, J., Liu, Y., Jiang, S.: Gridmm: Grid memory map for vision-and-language navigation. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 5579–5588 (2023).https: //doi.org/10.1109/ICCV51070.2023.01432,https://openaccess.thecvf.com/ content/ICCV2023/papers/Wang_GridMM_Grid_Memory_Map_for_Vision- a...

work page doi:10.1109/iccv51070.2023.01432 2023
[35]

Xu, H., Hu, Y., Zhu, Z., Gao, C., Wang, Z., Rao, J., Lu, W., Li, W., Yin, Q., Li, Y.: Citycube: Benchmarking cross-view spatial reasoning on vision-language models in urban environments (jan 2026),http://arxiv.org/abs/2601.14339v1, published: 2026-01-20T13:44:02Z; Updated: 2026-01-20T13:44:02Z; Categories: cs.CV; cs.AI; PDF: https://arxiv.org/pdf/2601.14339v1.pdf

work page arXiv 2026
[36]

Schnabel, and Tingying Peng

Xu, R., Xiang, H., Tu, Z., Xia, X., Yang, M.H., Ma, J.: V2x-vit: Vehicle-to- everything cooperative perception with vision transformer. In: European Confer- enceonComputerVision(ECCV)(2022).https://doi.org/10.1007/978-3-031- 19842- 7_7,https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/ 136990106.pdf

work page doi:10.1007/978-3-031- 2022
[37]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., Lin, D., Wang, T., Pang, J.: MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence (Sep 2025).https://doi.org/10.48550/arXiv.2505.23764, http://arxiv.org/abs/2505.23764,arXiv:2505.23764[cs]titleTranslation:Mmsi- bench TLDR: An automated error analysis pipeli...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.23764 2025
[38]

Zha, J., Fan, Y., Zhang, T., Chen, G., Chen, Y., Gao, C., Chen, X.: AirCop- Bench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning (Nov 2025).https://doi.org/10.48550/arXiv.2511.11025,http: //arxiv.org/abs/2511.11025, arXiv:2511.11025 [cs] titleTranslation: Aircop- bench TLDR: AirCopBench is introduced, the first comprehensive ...

work page doi:10.48550/arxiv.2511.11025 2025
[39]

Zhao, X., Zhou, G., Wu, Q.: Vln-mme: Diagnosing mllms as language-guided visual navigation agents (dec 2025),http://arxiv.org/abs/2512.24851v2, published: 2025-12-31T13:21:21Z; Updated: 2026-01-06T11:00:10Z; Categories: cs.CV; cs.RO; PDF: https://arxiv.org/pdf/2512.24851v2.pdf

work page arXiv 2025
[40]

Nocedal, S

Zhou, G., Hong, Y., Wang, Z., Wang, X.E., Wu, Q.: Navgpt-2: Unleashing navi- gational reasoning capability for large vision-language models. In: European Con- ference on Computer Vision (ECCV) (2024).https://doi.org/10.1007/978- 3- 031- 72667- 5_15,https://www.ecva.net/papers/eccv_2024/papers_ECCV/ papers/01143.pdf

work page doi:10.1007/978- 2024
[41]

Zhou, Y., Quang, L., Nieto-Granda, C., Loianno, G.: CoPeD-Advancing Multi- Robot Collaborative Perception: A Comprehensive Dataset in Real-World Envi- ronments (May 2024).https://doi.org/10.48550/arXiv.2405.14731,http: //arxiv.org/abs/2405.14731, arXiv:2405.14731 [cs] titleTranslation:

work page doi:10.48550/arxiv.2405.14731 2024

[1] [1]

AI, M.: Kimi large language model.https://kimi.moonshot.cn(2024)

2024

[2] [2]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

AI, Z.: Glm-4: Open multilingual multimodal large model. arXiv preprint arXiv:2406.12793 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3674–3683 (2018).https://doi.org/10....

work page doi:10.1109/cvpr.2018.00387 2018

[4] [4]

In: CVPR (2018)

Anderson, P., et al.: Vision-and-language navigation: Interpreting visually- grounded navigation instructions in real environments. In: CVPR (2018)

2018

[5] [5]

Anthropic: Claude 3 model card.https://www.anthropic.com/news/claude-3- family(2024)

2024

[6] [6]

In: ICCV (2015)

Antol, S., et al.: Vqa: Visual question answering. In: ICCV (2015)

2015

[7] [7]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., et al.: Qwen-vl: A versatile vision-language model for understanding, lo- calization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Basappa, A., Goel, P., Karra, A., Karra, A., Gilmore, A., Zhu, K.: Amvicc: A novel benchmark for cross-modal failure mode profiling for vlms and igms (jan 2026),http://arxiv.org/abs/2601.17037v1, published: 2026-01- 20T00:06:58Z; Updated: 2026-01-20T00:06:58Z; Categories: cs.CV; cs.AI; PDF: https://arxiv.org/pdf/2601.17037v1.pdf

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

ByteDance: Doubao large model.https://www.volcengine.com/product/doubao (2024)

2024

[10] [10]

AirNav: A Large-Scale UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

Cai, H., Rao, Y., Huang, L., Zhong, Z., Dong, J., Tan, J., Lu, W., Zhong, R.: Airnav: A large-scale real-world uav vision-and-language navigation dataset with natural and diverse instructions (jan 2026),http://arxiv.org/abs/2601. 03707v1, published: 2026-01-07T08:46:09Z; Updated: 2026-01-07T08:46:09Z; Cat- egories: cs.CL; PDF: https://arxiv.org/pdf/2601.0...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR)

Chen, S., Guhur, P.L., Tapaswi, M., Schmid, C., Laptev, I.: Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). pp. 16516–16525 (2022).https://doi.org/10.1109/CVPR52688. 2022 . 01604,https : / / openaccess . thecvf . com ...

work page doi:10.1109/cvpr52688 2022

[12] [12]

MM-UA VBench: How well do multimodal large language models see, think, and plan in low-altitude uav scenarios?

Dai, S., Ma, Z., Luo, Z., Yang, X., Huang, Y., Zhang, W., Chen, C., Guo, Z., Xu, W., Sun, Y., Sun, M.: Mm-uavbench: How well do multimodal large language mod- els see, think, and plan in low-altitude uav scenarios? (dec 2025),http://arxiv. org / abs / 2512 . 23219v1, published: 2025-12-29T05:49:54Z; Updated: 2025-12- 29T05:49:54Z; Categories: cs.CV; PDF: ...

work page arXiv 2025

[13] [13]

Gao, C., Zhao, B., Zhang, W., Mao, J., Zhang, J., Zheng, Z., Man, F., Fang, J., Zhou, Z., Cui, J., Chen, X., Li, Y.: EmbodiedCity: A Benchmark Plat- form for Embodied Agent in Real-world City Environment (Oct 2024).https: //doi.org/10.48550/arXiv.2410.09604,http://arxiv.org/abs/2410.09604, arXiv:2410.09604 [cs] titleTranslation: TLDR: A benchmark platform...

work page doi:10.48550/arxiv.2410.09604 2024

[14] [14]

Intelligence32(2), 175–191 (2004).https://doi.org/10

Hegarty, M., Waller, D.: A dissociation between mental rotation and perspective- taking spatial abilities. Intelligence32(2), 175–191 (2004).https://doi.org/10. 1016/j.intell.2003.12.001

2004

[15] [15]

In: Advances in Neu- ral Information Processing Systems (NeurIPS) (2022),https://papers.neurips

Hu, Y., Fang, S., Lei, Z., Zhong, Y., Chen, S.: Where2comm: Communication- efficient collaborative perception via spatial confidence maps. In: Advances in Neu- ral Information Processing Systems (NeurIPS) (2022),https://papers.neurips. cc / paper _ files / paper / 2022 / file / 1f5c5cd01b864d53cc5fa0a3472e152e - Paper-Conference.pdf

2022

[16] [16]

In: CVPR (2019)

Hudson, D., Manning, C.: Gqa: A new dataset for real-world visual reasoning. In: CVPR (2019)

2019

[17] [17]

Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., Yi, L.: Om- niSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Lan- guage Models (Sep 2025).https://doi.org/10.48550/arXiv.2506.03135,http: //arxiv.org/abs/2506.03135,arXiv:2506.03135[cs]titleTranslation:TLDR:Om- niSpatial is introduced, a comprehensive and challenging b...

work page doi:10.48550/arxiv.2506.03135 2025

[18] [18]

In: ECCV (2020)

Krantz, J., et al.: Beyond the nav-graph: Vision-and-language navigation in con- tinuous environments. In: ECCV (2020)

2020

[19] [19]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: Multi- lingual vision-and-language navigation with dense spatiotemporal grounding. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 4392–4412 (2020).https://doi.org/10.18653/v1/ 2020.emnlp-main.356,https://aclanthology.org/2020....

work page doi:10.18653/v1/ 2020

[20] [20]

SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

Lee, K., Lee, I., Kwak, M., Ryu, K., Hong, J., Park, J.: Spatialmosaic: A multi- view vlm dataset for partial visibility (dec 2025),http://arxiv.org/abs/2512. 23365v1, published: 2025-12-29T10:48:54Z; Updated: 2025-12-29T10:48:54Z; Cat- egories: cs.CV; PDF: https://arxiv.org/pdf/2512.23365v1.pdf

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Li, D., Li, H., Wang, Z., Yan, Y., Zhang, H., Chen, S., Hou, G., Jiang, S., Zhang, W., Shen, Y., Lu, W., Zhuang, Y.: ViewSpatial-Bench: Evaluating Multi- perspective Spatial Localization in Vision-Language Models (Sep 2025).https: //doi.org/10.48550/arXiv.2505.21500,http://arxiv.org/abs/2505.21500, arXiv:2505.21500 [cs] titleTranslation: - TLDR: This work...

work page doi:10.48550/arxiv.2505.21500 2025

[22] [22]

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?arXiv:2503.23765, 2025

Li, Y., Zhang, Y., Lin, T., Liu, X., Cai, W., Liu, Z., Zhao, B.: STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding? (Jul 2025). https://doi.org/10.48550/arXiv.2503.23765,http://arxiv.org/abs/2503. 23765, arXiv:2503.23765 [cs] titleTranslation: Sti-benchMllms TLDR: STI-Bench is introduced, a benchmark designed to evaluate MLLMs’s ...

work page doi:10.48550/arxiv.2503.23765 2025

[23] [23]

org / abs / 2512

Liu, X., Liu, Y., Qiu, H., Qirong, Y., Lian, Z.: Indooruav: Bench- marking vision-language uav navigation in continuous indoor environments (dec 2025),http : / / arxiv . org / abs / 2512 . 19024v1, published: 2025-12- 22T04:42:35Z; Updated: 2025-12-22T04:42:35Z; Categories: cs.RO; cs.AI; PDF: https://arxiv.org/pdf/2512.19024v1.pdf AirGroundBench 17

work page arXiv 2025

[24] [24]

Clarendon Press, Oxford University Press (1978)

O’Keefe, J., Nadel, L.: The Hippocampus as a Cognitive Map. Clarendon Press, Oxford University Press (1978)

1978

[25] [25]

OpenAI: Gpt-4v(ision) system card.https://openai.com/research/gpt- 4v- system-card(2023)

2023

[26] [26]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W.Y., Shen, C., van den Hen- gel, A.: Reverie: Remote embodied visual referring expression in real indoor en- vironments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9989–9998 (2020).https://doi.org/10. 1109/CVPR42600.2020.01000,https://openaccess.thecvf.co...

work page arXiv 2020

[27] [27]

Science 171(3972), 701–703 (1971).https://doi.org/10.1126/science.171.3972.701

Shepard, R.N., Metzler, J.: Mental rotation of three-dimensional objects. Science 171(3972), 701–703 (1971).https://doi.org/10.1126/science.171.3972.701

work page doi:10.1126/science.171.3972.701 1971

[28] [28]

Sohn, T.S., Dillitzer, M., Corso, J.J., Sax, E.: Embodied4C: Measuring What Mat- ters for Embodied Vision-Language Navigation (Dec 2025).https://doi.org/10. 48550/arXiv.2512.18028,http://arxiv.org/abs/2512.18028, arXiv:2512.18028 [cs] titleTranslation: 4C- TLDR: Comprehensive evaluation across ten state-of-the- art VLMs and four embodied control baselines...

work page arXiv 2025

[29] [29]

arXiv preprint arXiv:2312.14838 (2023)

Sun, Y., et al.: Ernie 4.0: Knowledge enhanced foundation model. arXiv preprint arXiv:2312.14838 (2023)

work page arXiv 2023

[30] [30]

Gemini: A Family of Highly Capable Multimodal Models

Team, G.D.: Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Team, Q.: Qwen2-vl: Enhancing vision-language models with multimodal reason- ing. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Tencent: Hunyuan large model.https://hunyuan.tencent.com(2024)

2024

[33] [33]

Cognitive maps in rats and men.Psychological Review, 55(4):189–208, 1948

Tolman, E.C.: Cognitive maps in rats and men. Psychological Review55(4), 189– 208 (1948).https://doi.org/10.1037/h0061626

work page doi:10.1037/h0061626 1948

[34] [34]

In: IEEE/CVF International Conference on Computer Vision

Wang, Z., Li, X., Yang, J., Liu, Y., Jiang, S.: Gridmm: Grid memory map for vision-and-language navigation. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV). pp. 5579–5588 (2023).https: //doi.org/10.1109/ICCV51070.2023.01432,https://openaccess.thecvf.com/ content/ICCV2023/papers/Wang_GridMM_Grid_Memory_Map_for_Vision- a...

work page doi:10.1109/iccv51070.2023.01432 2023

[35] [35]

Xu, H., Hu, Y., Zhu, Z., Gao, C., Wang, Z., Rao, J., Lu, W., Li, W., Yin, Q., Li, Y.: Citycube: Benchmarking cross-view spatial reasoning on vision-language models in urban environments (jan 2026),http://arxiv.org/abs/2601.14339v1, published: 2026-01-20T13:44:02Z; Updated: 2026-01-20T13:44:02Z; Categories: cs.CV; cs.AI; PDF: https://arxiv.org/pdf/2601.14339v1.pdf

work page arXiv 2026

[36] [36]

Schnabel, and Tingying Peng

Xu, R., Xiang, H., Tu, Z., Xia, X., Yang, M.H., Ma, J.: V2x-vit: Vehicle-to- everything cooperative perception with vision transformer. In: European Confer- enceonComputerVision(ECCV)(2022).https://doi.org/10.1007/978-3-031- 19842- 7_7,https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/ 136990106.pdf

work page doi:10.1007/978-3-031- 2022

[37] [37]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., Lin, D., Wang, T., Pang, J.: MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence (Sep 2025).https://doi.org/10.48550/arXiv.2505.23764, http://arxiv.org/abs/2505.23764,arXiv:2505.23764[cs]titleTranslation:Mmsi- bench TLDR: An automated error analysis pipeli...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.23764 2025

[38] [38]

Zha, J., Fan, Y., Zhang, T., Chen, G., Chen, Y., Gao, C., Chen, X.: AirCop- Bench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning (Nov 2025).https://doi.org/10.48550/arXiv.2511.11025,http: //arxiv.org/abs/2511.11025, arXiv:2511.11025 [cs] titleTranslation: Aircop- bench TLDR: AirCopBench is introduced, the first comprehensive ...

work page doi:10.48550/arxiv.2511.11025 2025

[39] [39]

Zhao, X., Zhou, G., Wu, Q.: Vln-mme: Diagnosing mllms as language-guided visual navigation agents (dec 2025),http://arxiv.org/abs/2512.24851v2, published: 2025-12-31T13:21:21Z; Updated: 2026-01-06T11:00:10Z; Categories: cs.CV; cs.RO; PDF: https://arxiv.org/pdf/2512.24851v2.pdf

work page arXiv 2025

[40] [40]

Nocedal, S

Zhou, G., Hong, Y., Wang, Z., Wang, X.E., Wu, Q.: Navgpt-2: Unleashing navi- gational reasoning capability for large vision-language models. In: European Con- ference on Computer Vision (ECCV) (2024).https://doi.org/10.1007/978- 3- 031- 72667- 5_15,https://www.ecva.net/papers/eccv_2024/papers_ECCV/ papers/01143.pdf

work page doi:10.1007/978- 2024

[41] [41]

Zhou, Y., Quang, L., Nieto-Granda, C., Loianno, G.: CoPeD-Advancing Multi- Robot Collaborative Perception: A Comprehensive Dataset in Real-World Envi- ronments (May 2024).https://doi.org/10.48550/arXiv.2405.14731,http: //arxiv.org/abs/2405.14731, arXiv:2405.14731 [cs] titleTranslation:

work page doi:10.48550/arxiv.2405.14731 2024