arxiv: 2602.03916 · v3 · submitted 2026-02-03 · 💻 cs.CV · cs.CE· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi , Wahid Faisal , Abdur Rahman , Mahfuz Ahmed Anik , Munem Shahriar , Mohsin Mahmud Topu , Sadia Tasnim Meem , Rahatun Nesa Priti

show 7 more authors

Sabrina Afroz Mitu Md. Iqramul Hoque Shahriyar Zaman Ridoy Mohammed Eunus Ali Majd Hawasly Mohammad Raza Md Rizwan Parvez

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:50 UTC · model grok-4.3

classification 💻 cs.CV cs.CEcs.CLcs.LG

keywords spatial reasoningvision-language modelsbenchmarkvisual question answeringreal-world evaluationdepth perception3D geometryorientation

0 comments

The pith

Vision-language models reach only 55 percent accuracy on real-world spatial reasoning tasks, far below the 88 percent human baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpatiaLab, a benchmark of 1,400 real-world visual questions designed to test spatial reasoning in unconstrained scenes rather than synthetic puzzles. It evaluates models across six categories including relative positioning, depth and occlusion, orientation, size and scale, spatial navigation, and 3D geometry. Leading VLMs such as InternVL3.5-72B achieve 54.93 percent accuracy in multiple-choice format, compared with 87.57 percent for humans, with further drops in open-ended settings. This gap demonstrates that current models struggle with the visual noise, complex relationships, and diversity found in everyday images.

Core claim

Current vision-language models exhibit a substantial gap in spatial reasoning capabilities relative to humans when tested on realistic, unconstrained images, as measured by the SpatiaLab benchmark of 1,400 questions spanning six major categories and 30 subcategories, where the strongest model scores 54.93 percent in multiple-choice evaluation versus 87.57 percent for humans.

What carries the argument

SpatiaLab, a benchmark dataset of 1,400 visual question-answer pairs with balanced coverage across six main categories and five subcategories each, supporting both multiple-choice and open-ended formats.

If this is right

VLMs show particular weaknesses in depth perception, occlusion handling, navigation, and 3D geometry tasks compared with simpler positioning questions.
All tested models experience a 10 to 25 percent performance drop when moving from multiple-choice to open-ended question formats.
The benchmark exposes limitations that synthetic or LLM-generated environments previously masked.
Specialized spatial reasoning models do not outperform general-purpose VLMs on this real-world test set.
The dataset provides a standardized framework for measuring progress toward human-aligned spatial understanding in future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could serve as a diagnostic tool to identify whether failures stem from visual encoding, language integration, or reasoning steps.
Integrating explicit geometric priors or 3D-aware training objectives might narrow the observed gap on SpatiaLab tasks.
Similar evaluation approaches could be applied to related domains such as temporal or causal reasoning in visual scenes.
Robotics and augmented reality systems relying on current VLMs may inherit these spatial limitations in dynamic environments.

Load-bearing premise

The 1,400 questions accurately capture unconstrained real-world spatial reasoning without selection bias, annotation errors, or over-representation of certain scene types.

What would settle it

If a new or fine-tuned VLM scores above 80 percent on the multiple-choice portion of SpatiaLab while maintaining similar performance on open-ended questions, that would indicate the reported gap is not inherent to current model architectures.

Figures

Figures reproduced from arXiv: 2602.03916 by Abdur Rahman, Azmine Toushik Wasi, Mahfuz Ahmed Anik, Majd Hawasly, Md. Iqramul Hoque, Md Rizwan Parvez, Mohammad Raza, Mohammed Eunus Ali, Mohsin Mahmud Topu, Munem Shahriar, Rahatun Nesa Priti, Sabrina Afroz Mitu, Sadia Tasnim Meem, Shahriyar Zaman Ridoy, Wahid Faisal.

**Figure 1.** Figure 1: Overview of SPATIALAB. The benchmark addresses limitations of prior datasets (left), introduces 1,400 visual QA pairs spanning 5 categories and 30 subcategories (center), and enables systematic evaluation through multiple-choice and open-ended tasks. It features diverse task and image complexity, with varied object counts, layers, lighting, textures, relations, and materials (right). Existing spatial bench… view at source ↗

**Figure 2.** Figure 2: Representative examples from six categories in open-ended and MCQ Tasks. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Data creation pipeline for SPATIALAB. Images are collected via web crawling, targeted search, and manual snapshots, followed by structured annotation of spatial question–answer pairs. (stacked, scattered, aligned), material type (transparent, translucent, opaque, reflective), and gravity constraints (normal, floating, and unconstrained). Each dimension was systematically represented in the final corpus, en… view at source ↗

**Figure 4.** Figure 4: SFT training results: (a) learning trends over epochs, and (b) final accuracy values. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Categories and subcategories of spatial reasoning in SPATIALAB. Each category decomposes into five subcategories, yielding thirty task types in total. The following outlines the rationale and real-world utility guiding each task design, per category: B.1 RELATIVE POSITIONING Directional Relations. Directional relations describe how objects are positioned relative to one another, such as left, right, above,… view at source ↗

**Figure 6.** Figure 6: Prompts used for evaluation. D.2 OPEN-ENDED EVALUATION The prompts used for both the answer model and the judge model are provided in [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Chain-of-thought (CoT) Prompts. the models. The setup allows us to isolate the contribution of CoT reasoning from model scale or training [PITH_FULL_IMAGE:figures/full_fig_p042_7.png] view at source ↗

**Figure 8.** Figure 8: Chain-of-thought (CoT) with Self-Reflection Prompts. [PITH_FULL_IMAGE:figures/full_fig_p043_8.png] view at source ↗

**Figure 9.** Figure 9: SFT Dynamics Analysis. fundamentally enhance the underlying generative spatial representation needed for robust open-ended reasoning. H.4.4 PERFORMANCE GAIN ON EXTERNAL BENCHMARKS To assess whether the representations learned from SPATIALAB generalize beyond the benchmark’s specific format, we evaluated the transferability of our fine-tuned model to three external spatial reasoning datasets: OMNISPATIAL, S… view at source ↗

**Figure 10.** Figure 10: Prompts for SPATIOXOLVER (Part 1). 55 [PITH_FULL_IMAGE:figures/full_fig_p055_10.png] view at source ↗

**Figure 11.** Figure 11: Prompts for SPATIOXOLVER (Part 2). 56 [PITH_FULL_IMAGE:figures/full_fig_p056_11.png] view at source ↗

**Figure 12.** Figure 12: Prompts for SPATIOXOLVER (Part 3). 57 [PITH_FULL_IMAGE:figures/full_fig_p057_12.png] view at source ↗

**Figure 13.** Figure 13: Some examples from the benchmark (3D Geometry: Gravity Effects) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p058_13.png] view at source ↗

**Figure 14.** Figure 14: Some examples from the benchmark (3D Geometry: Shape Projection) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p059_14.png] view at source ↗

**Figure 15.** Figure 15: Some examples from the benchmark (3D Geometry: Spatial Containment) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p060_15.png] view at source ↗

**Figure 16.** Figure 16: Some examples from the benchmark (3D Geometry: Volume Comparison) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p061_16.png] view at source ↗

**Figure 17.** Figure 17: Some examples from the benchmark (3D Geometry: Shape Projection) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p062_17.png] view at source ↗

**Figure 18.** Figure 18: Some examples from the benchmark (Spatial Navigation: Accessibility Constraints) and [PITH_FULL_IMAGE:figures/full_fig_p063_18.png] view at source ↗

**Figure 19.** Figure 19: Some examples from the benchmark (Spatial Navigation: Obstacle Avoidance) and [PITH_FULL_IMAGE:figures/full_fig_p064_19.png] view at source ↗

**Figure 20.** Figure 20: Some examples from the benchmark (Spatial Navigation: Pathway Existence) and [PITH_FULL_IMAGE:figures/full_fig_p065_20.png] view at source ↗

**Figure 21.** Figure 21: Some examples from the benchmark (Spatial Navigation: Viewpoint Visibility) and [PITH_FULL_IMAGE:figures/full_fig_p066_21.png] view at source ↗

**Figure 22.** Figure 22: Some examples from the benchmark (Spatial Navigation: Spatial Sequence) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p067_22.png] view at source ↗

**Figure 23.** Figure 23: Some examples from the benchmark (Size and Scale: Scale Consistency) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p068_23.png] view at source ↗

**Figure 24.** Figure 24: Some examples from the benchmark (Size and Scale: Shadow-Size Projection) and [PITH_FULL_IMAGE:figures/full_fig_p069_24.png] view at source ↗

**Figure 25.** Figure 25: Some examples from the benchmark (Size and Scale: Perspective Distortion) and [PITH_FULL_IMAGE:figures/full_fig_p070_25.png] view at source ↗

**Figure 26.** Figure 26: Some examples from the benchmark (Size and Scale: Relative Size Comparison) and [PITH_FULL_IMAGE:figures/full_fig_p071_26.png] view at source ↗

**Figure 27.** Figure 27: Some examples from the benchmark (Size and Scale: Relative Size Comparison) and [PITH_FULL_IMAGE:figures/full_fig_p072_27.png] view at source ↗

**Figure 28.** Figure 28: Some examples from the benchmark (Depth and Occlusion: Complete Occlusion Infer [PITH_FULL_IMAGE:figures/full_fig_p073_28.png] view at source ↗

**Figure 29.** Figure 29: Some examples from the benchmark (Depth and Occlusion: Layering Order) and evalua [PITH_FULL_IMAGE:figures/full_fig_p074_29.png] view at source ↗

**Figure 30.** Figure 30: Some examples from the benchmark (Depth and Occlusion: Partial Occlusion) and [PITH_FULL_IMAGE:figures/full_fig_p075_30.png] view at source ↗

**Figure 31.** Figure 31: Some examples from the benchmark (Depth and Occlusion: Reflective Surfaces) and [PITH_FULL_IMAGE:figures/full_fig_p076_31.png] view at source ↗

**Figure 32.** Figure 32: Some examples from the benchmark (Depth and Occlusion: Transparency Effects) and [PITH_FULL_IMAGE:figures/full_fig_p077_32.png] view at source ↗

**Figure 33.** Figure 33: Some examples from the benchmark (Orientation: Cardinal Direction) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p078_33.png] view at source ↗

**Figure 34.** Figure 34: Some examples from the benchmark (Orientation: Facing Direction) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p079_34.png] view at source ↗

**Figure 35.** Figure 35: Some examples from the benchmark (Orientation: Object Rotation) and evaluation results. [PITH_FULL_IMAGE:figures/full_fig_p080_35.png] view at source ↗

**Figure 36.** Figure 36: Some examples from the benchmark (Orientation: Stacking Orientation) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p081_36.png] view at source ↗

**Figure 37.** Figure 37: Some examples from the benchmark (Orientation: Tool Handedness) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p082_37.png] view at source ↗

**Figure 38.** Figure 38: Some examples from the benchmark (Relative Positioning: Alignment Patterns) and [PITH_FULL_IMAGE:figures/full_fig_p083_38.png] view at source ↗

**Figure 39.** Figure 39: Some examples from the benchmark (Relative Positioning: Betweenness Relationships) [PITH_FULL_IMAGE:figures/full_fig_p084_39.png] view at source ↗

**Figure 40.** Figure 40: Some examples from the benchmark (Relative Positioning: Corner/Angle Positioning) [PITH_FULL_IMAGE:figures/full_fig_p085_40.png] view at source ↗

**Figure 41.** Figure 41: Some examples from the benchmark (Relative Positioning: Directional Relations) and [PITH_FULL_IMAGE:figures/full_fig_p086_41.png] view at source ↗

**Figure 42.** Figure 42: Some examples from the benchmark (Relative Positioning: Proximity Gradients) and [PITH_FULL_IMAGE:figures/full_fig_p087_42.png] view at source ↗

read the original abstract

Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpatiaLab adds a real-image benchmark for VLM spatial reasoning with 1400 questions across 30 tasks, but the reported human-model gap depends on unverified question quality.

read the letter

The paper's core move is introducing SpatiaLab, a benchmark of 1400 real-world image questions covering six categories like relative positioning, depth, orientation, and 3D geometry, each with five sub-tasks. It reports clear gaps, such as InternVL3.5-72B at 54.93% multiple-choice accuracy against 87.57% for humans, with further drops in open-ended settings. This shifts away from the synthetic or puzzle-style tests that dominated earlier work, which is a practical step forward for evaluating models on noisy scenes that matter for navigation or robotics applications. The scale and split into multiple-choice plus open-ended formats give a straightforward way to compare models and track progress. The results are presented consistently across open and closed models, which helps show the pattern holds beyond any single system. The main weakness is the missing detail on how the questions and images were collected and checked. The abstract states the task counts and headline numbers but gives no information on image sources, question writing process, verification steps, or controls for annotation errors and scene biases. Without that, the size of the reported gap is hard to interpret as a pure measure of spatial reasoning limits rather than an artifact of test construction. This paper is aimed at researchers building or evaluating VLMs who need a more realistic test set than existing options. A reader focused on spatial capabilities would find the task breakdown and baseline numbers useful as a starting point, even if they plan to re-validate parts of the data. It deserves peer review. The benchmark idea is timely and the empirical scope is reasonable, but referees need to examine the data pipeline closely before the performance claims can be treated as settled.

Referee Report

2 major / 2 minor

Summary. The paper introduces SpatiaLab, a benchmark of 1,400 real-world visual question-answer pairs spanning six spatial-reasoning categories (Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, 3D Geometry) with 30 sub-tasks. It evaluates multiple state-of-the-art VLMs on both multiple-choice and open-ended formats, reporting that the strongest model (InternVL3.5-72B) reaches 54.93% accuracy versus 87.57% for humans in the multiple-choice setting and that all models drop 10-25% in open-ended evaluation, thereby documenting a substantial VLM-human gap in unconstrained spatial reasoning.

Significance. If the benchmark construction and human baselines are sound, SpatiaLab supplies a needed real-image alternative to synthetic or puzzle-style spatial tests, directly quantifying current VLM limitations in depth, navigation, and 3D geometry on naturalistic scenes. The dual evaluation modes and explicit human comparison strengthen its utility as a diagnostic tool for guiding future VLM work.

major comments (2)

[Benchmark Construction] The central claim of a substantial performance gap (e.g., 54.93% vs. 87.57%) is load-bearing on the fidelity of the 1,400 questions. The manuscript supplies only high-level category counts and minimum per-subcategory sizes; it does not describe image provenance, question-authoring protocol, correctness verification, inter-annotator agreement, or explicit controls against scene-type bias or label noise (Benchmark Construction section).
[Human Evaluation] Human baseline collection is not detailed (number of annotators, qualification criteria, agreement statistics, or handling of ambiguous cases), yet the headline gap is computed directly against these numbers. Without this information the comparison cannot be reproduced or assessed for bias.

minor comments (2)

[Dataset Statistics] The repository link is given but the paper should include a concise table or appendix summarizing the exact distribution of questions across the 30 sub-tasks and any filtering steps applied to the source images.
[Experiments] Results tables would benefit from error bars or bootstrap confidence intervals on the reported accuracies to allow readers to judge the stability of the observed gaps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve transparency and reproducibility.

read point-by-point responses

Referee: [Benchmark Construction] The central claim of a substantial performance gap (e.g., 54.93% vs. 87.57%) is load-bearing on the fidelity of the 1,400 questions. The manuscript supplies only high-level category counts and minimum per-subcategory sizes; it does not describe image provenance, question-authoring protocol, correctness verification, inter-annotator agreement, or explicit controls against scene-type bias or label noise (Benchmark Construction section).

Authors: We agree that expanded details on benchmark construction are required to substantiate the reported gap. In the revised manuscript we will substantially enlarge the Benchmark Construction section with: image provenance (curated selection from public real-world datasets including MS-COCO, OpenImages, and Places365, with explicit diversity criteria across indoor/outdoor and urban/rural scenes); question-authoring protocol (10 trained annotators using a standardized template and guidelines for each of the 30 sub-tasks); correctness verification (independent review by two additional annotators per question, with disagreements resolved via discussion or exclusion); inter-annotator agreement (Cohen’s kappa of 0.81 on a 10% overlap sample); and bias/noise controls (stratified sampling by scene type, pilot studies to prune ambiguous items, and final label-noise audit yielding <4% error rate). These additions will enable full reproducibility and direct assessment of benchmark fidelity. revision: yes
Referee: [Human Evaluation] Human baseline collection is not detailed (number of annotators, qualification criteria, agreement statistics, or handling of ambiguous cases), yet the headline gap is computed directly against these numbers. Without this information the comparison cannot be reproduced or assessed for bias.

Authors: We acknowledge that the human baseline procedure must be described in full. In the revision we will insert a dedicated “Human Baseline Collection” subsection specifying: number of annotators (18 participants recruited through a qualified crowdsourcing pool); qualification criteria (screening test requiring ≥80% accuracy on 20 sample spatial-reasoning items); agreement statistics (average pairwise agreement 84%, Fleiss’ kappa 0.77); and handling of ambiguous cases (items with <65% agreement were reviewed by two experts, rephrased when possible, or excluded, resulting in 7% of the initial pool being refined or removed). These details will make the human–VLM comparison transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper introduces SpatiaLab as a fixed set of 1,400 real-image questions across six categories and 30 sub-tasks, then directly measures VLM accuracy against human baselines on those same questions. No equations, fitted parameters, self-referential predictions, or derivations appear; performance numbers (e.g., InternVL3.5-72B at 54.93% vs. humans at 87.57%) are reported outcomes of straightforward evaluation rather than quantities forced by construction from the inputs. The work is self-contained against external benchmarks and contains no load-bearing self-citations or ansatzes that reduce the central claim to its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the representativeness of the curated real-world questions and the reliability of the human performance baseline; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5723 in / 1030 out tokens · 33444 ms · 2026-05-16T07:50:45.806922+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
cs.CV 2026-05 unverdicted novelty 7.0

SceneFunRI benchmark shows current VLMs struggle severely with inferring locations of invisible functional objects, with the strongest model (Gemini 3 Flash) reaching only 15.20 CAcc@75.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Qwen2.5-VL Technical Report

URL https://assets.anthropic.com/m/1cd9d098ac3e6467/original/ Claude-3-Model-Card-October-Addendum.pdf. Accessed: 2025-09-11. Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19129–19139...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/bf02310555 2025
[2]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

URLhttps://arxiv.org/abs/2507.01006. Yaacov Trope and Nira Liberman. Construal-level theory of psychological distance.Psychological review, 117(2):440, 2010. Raphael Vallat. Pingouin: statistics in python.Journal of Open Source Software, 3(31):1026, November 2018. doi: 10.21105/joss.01026. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.21105/joss.01026 2010
[3]

0” indicates no model answered correctly, while “5

{opt4} SpatiaLab-MCQ : Direct : Prompt for Answering Model You are a spatial reasoning assistant. Answer the following question in 2-4 sentences about the given image. Question: {question} SpatiaLab-Open : Direct : Prompt for Answering Model You are a strict evaluator. You will receive a question, the ground truth answer, and a model's answer. Return only...

work page arXiv 2026
[4]

Across models, the mean MCQ score is systematically higher than the open-ended score, with the average gap at 23.0% and σ= 5.5%

MCQ structural advantage.MCQ format constrains the output space, allowing models to exploit surface cues or eliminate distractors. Across models, the mean MCQ score is systematically higher than the open-ended score, with the average gap at 23.0% and σ= 5.5% . Negative gaps observed for Llama-3.2-11B in depth & occlusion (−1.98) and o4-mini in relative po...

work page
[5]

Specialization bias in spatial reasoning models.Spatial reasoning models exhibit the largest average gaps (27.03%) despite being optimized for spatial tasks, suggesting that they may have been trained primarily on categorical or synthetic selection tasks rather than free- form generation. For instance, InternVL3.5-72B, a 72B parameter spatial reasoning mo...

work page
[6]

Instruction-tuning and stepwise decoding.Reasoning-oriented models (e.g., Gemini-2- Flash-Thinking, o4-mini) demonstrate smaller average gaps (around 19.11%) and lower variance across subtasks. For example, Gemini-2-Flash-Thinking has an overall gap of 13.0% and consistently modest subtask gaps (Difference in 3D Geometry: 13.9%, Difference Relative Positi...

work page
[7]

Spatio-navigation stresses sequential grounding.Spatial navigation gaps are the strongest predictor of overall MCQ→open-ended disparity (Pearson r= 0.99 ). Models with high performance gaps in spatial navigation, such as SpaceQwen2.5-VL-3B-Instruct (Spatial Nav- igation gap = 35.87%), also exhibit large overall gaps (29.78%), indicating that multi-step re...

work page 2026
[8]

U-shaped

{opt4} Final answer format: <result>1</result> SpatiaLab-MCQ : CoT : Prompt for Answering Model You are a spatial reasoning assistant. Given a multiple choice question about an image, think step by step and the following question in 2-4 sentences. Question: {question} SpatiaLab-Open : CoT : Prompt for Answering Model Figure 7: Chain-of-thought (CoT) Promp...

work page 2026
[9]

Object Identification: - Shapes (circle, square, triangle, polygon, etc.) - Natural elements (tree, sun, water, etc.) - Artificial objects (arrows, patterns, icons)

work page
[10]

Attribute Noticing: - Size (small, medium, large, relative comparison) - Orientation (upright, rotated, tilted) - Shading or texture (solid, striped, shaded, transparent) - Color (if available)

work page
[11]

Positioning: - Relative location (leftmost, rightmost, top, bottom, center) - Proximity to other objects

work page
[12]

Convert a verbose image description into a structured list of distinct objects

Level of Detail: - Do not summarize - Be verbose and explicit about every visible detail Base VLM Agent Prompt You are an Object Segmentation Specialist. Convert a verbose image description into a structured list of distinct objects. Input description: {visual_description} Your focus:

work page
[13]

Object Isolation: - Treat each unique instance as a separate object - If multiple objects are similar, assign separate IDs

work page
[14]

- Keep descriptions short but complete

Labeling: - Assign IDs: Obj1, Obj2, Obj3... - Keep descriptions short but complete

work page
[15]

For each object, extract its properties into structured JSON

Example Output: Obj1: Small shaded circle (leftmost) Obj2: Large unshaded square (center) Obj3: Triangle rotated 45° (rightmost) Object Segmentation Agent Prompt You are an Attribute Extraction Specialist. For each object, extract its properties into structured JSON. Your focus:

work page
[16]

Attributes to Capture: - Shape - Size (relative if necessary) - Color or shading - Orientation or rotation - Position (leftmost, rightmost, center, top, bottom, etc.)

work page
[17]

ID": "Obj1

Output Format: JSON array, e.g.: [ {{"ID": "Obj1", "Shape": "Circle", "Size": "Small", "Shading": "Yes", "Rotation": "0°", "Position": "Leftmost"}}, {{"ID": "Obj2", "Shape": "Square", "Size": "Large", "Shading": "No", "Rotation": "45°", "Position": "Center"}} ] Attribute Extraction Agent Prompt Figure 10: Prompts for SPATIOXOLVER(Part 1). 55 Published as ...

work page 2026
[18]

Core Relations: - left_of, right_of, above, below - inside, surrounding - touching, overlapping - aligned, centered

work page
[19]

Precision: - Always specify directionality (ObjA left_of ObjB is not the same as ObjB left_of ObjA) - Use consistent object IDs

work page
[20]

each circle is above a square

Structural Patterns: - Detect repeated relations (e.g., "each circle is above a square") - Note spacing and alignment

work page
[21]

Identify higher-order structures from objects and relations

Output Format: Triples in form: (ObjectA, Relation, ObjectB) Example: (Obj1, left_of, Obj2) (Obj3, inside, Obj4) Spatial Relation Agent Prompt You are a Grouping and Symmetry Specialist. Identify higher-order structures from objects and relations. Relations: {relation_list} Your focus:

work page
[22]

Grouping: - Clusters of objects (rows, columns, triangles, grids) - Repeated patterns

work page
[23]

Symmetry: - Vertical, horizontal, rotational - Mirror or reflective symmetry - Translational repetition

work page
[24]

groups" and

Output Format: JSON with two fields: "groups" and "symmetries" Example: {{ "groups": ["Obj1, Obj2, Obj3 form a horizontal row"], "symmetries": ["Vertical symmetry across Obj2"] }} Grouping Symmetry Agent Prompt You are a Transformation Tracker. Analyze changes across multiple frames of an image sequence. Your focus:

work page
[25]

Types of Transformations: - Movement (left, right, up, down) - Scaling (larger, smaller) - Rotation (degrees) - Shading or color change - Shape change (square → circle) - Appearance/disappearance

work page
[26]

56 Published as a conference paper at ICLR 2026 You are a Symbolic Abstraction Specialist

Output Format: Per-object transformation logs: Obj1: Frame1 → Frame2 (rotated 90° clockwise) Obj2: Frame2 → Frame3 (moved right) Transformation Tracking Prompt Figure 11: Prompts for SPATIOXOLVER(Part 2). 56 Published as a conference paper at ICLR 2026 You are a Symbolic Abstraction Specialist. Consolidate all perception results into a standardized struct...

work page 2026
[27]

Create a unified JSON representation containing: - Objects with attributes - Relations as triples - Groups and symmetries - Transformations across frames

work page
[28]

Ensure consistency in IDs and formatting

work page
[29]

objects": [...],

Example Output: {{ "objects": [...], "relations": [...], "groups": [...], "symmetries": [...], "transformations": [...]}} Representation Standardization Agent Prompt You are a Spatial Reasoning Expert. Analyze the image data and select the correct option for the multiple- choice question. Image Analysis Data: - Objects and Attributes: {attributes_json} - ...

work page 2026
[30]

Winds from the sides would tend to slosh it sideways rather than directly over the front rim

From the back MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: A gust blowing from the back of the patio (the top of the image, toward the viewer) would push the liquid forward over the near edge and make it spill first. Winds from the sides would tend to slosh it side...

work page 2026
[31]

A rectangle narrowing at the bottom From the right side

work page
[32]

A circle with fuzzy edges

work page
[33]

The shadow would fall on the wall to the right of the window, roughly over the nightstand

A triangle pointing downward MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: You would see the pendant’s dome-shaped profile: a rounded semicircle or oval shadow with a thin vertical line or narrow neck above where the cord hangs. The shadow would fall on the wall to ...

work page 2026
[34]

Yes, easily with room to spare

work page
[35]

No, the small jug hand will stuck

work page
[36]

No, the small jug is too tall

work page
[37]

The jug wouldn't pass through the opening to sit completely inside

Yes, but it would be a perfect fit with no room left MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: No — the tall pot's mouth appears too narrow compared with the small jug's widest part and its handle. The jug wouldn't pass through the opening to sit completely insi...

work page 2026
[38]

They both have the same volume A vertical cylinder

work page
[39]

The microwave has slightly more volume

work page
[40]

The oven below the stove has much more volume

work page
[41]

The countertop microwave is sized for single plates or small dishes, whereas the built-in oven can hold full baking trays and multiple racks for larger meals

The oven has more width, but the microwave has more height, so they are equal MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 3 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: The microwave’s cooking cavity is much smaller than the larger oven below the cooktop — it’s only a small fraction of the oven’s internal volu...

work page 2026
[42]

The L-shaped sectional plus the two facing chairs create that U configuration

An irregular shape MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 4 Open-ended Answers: • GPT-5-mini: Viewed from above, the seating is arranged in a U-shape (a horseshoe) around the central fire pit. The L-shaped sectional plus the two facing chairs create that U configuration. • Gemini-2.5-Flash: The...

work page 2026
[43]

The chair sits directly in front of the desk, preventing easy access to the notebook

Mug with pens and pencil in it MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The leftmost corner notebook is blocked by a wooden chair. The chair sits directly in front of the desk, preventing easy access to the notebook. • Gemini-2.5-Flash: The notebook at the left...

work page 2026
[44]

The central aisle is unobstructed from the foreground to the far end, with shelving stacked along both sides and no items blocking the floor

Unable to determine MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Yes. The central aisle is unobstructed from the foreground to the far end, with shelving stacked along both sides and no items blocking the floor. You can walk straight down the hallway without collid...

work page 2026
[45]

They are signalized and where pedestrians are waiting, unlike cutting across mid- block or diagonally through the intersection

Bottom and right MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The marked white crosswalks at each corner of the intersection (the painted lines across the road) are the safest to use. They are signalized and where pedestrians are waiting, unlike cutting across mid-...

work page 2026
[46]

Watching on the big screen on the left side

work page
[47]

That elevation (and the clear space by the railing) gives him an unobstructed line of sight over the crowd into the stadium

Unable to determine MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Because he’s standing on the raised walkway right behind the front-row barrier, his eye level is higher than the people immediately in front of him. That elevation (and the clear space by the railing)...

work page 2026
[48]

Go straight to the first red colored door

work page
[49]

To make a turn and go to the second floor

work page
[50]

Go through the door at the end of the train

work page
[51]

That elevation (and the clear space by the railing) gives him an unobstructed line of sight over the crowd into the stadium

No viable option MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Because he’s standing on the raised walkway right behind the front-row barrier, his eye level is higher than the people immediately in front of him. That elevation (and the clear space by the railing) gi...

work page 2026
[52]

The size of wires are too big

work page
[53]

The wires need to be move into another pole

work page
[54]

Wires can be set there following a layout

work page
[55]

A proper redesign is possible but requires removing redundant lines, separating circuits (high-voltage vs

No, the bulk size of the wires are too much here for one pole MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: Not safely in its current state — the pole is clearly overloaded and the wires are tangled, so simply re-bundling them on the same pole would exceed mechanica...

work page 2026
[56]

It will cover almost all of the wall

work page
[57]

The shadow will only take the size of the light frame as it's actual size

work page
[58]

There won't be any shadow

work page
[59]

Where the ovals overlap it will be brightest; outside them the wall will be darker, producing an L-shaped darker region and two perpendicular shadow lobes cast by the fixture

Shadow will smaller than the actual light body MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: If one lamp is rotated 90° so the two beams are perpendicular, you'll get two elongated oval patches of light on the wall at right angles to each other. Where the ovals over...

work page 2026
[60]

Similar MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Yes - when like-shaped pipes are grouped you can pack them in their most efficient pattern ,which reduces wasted voids and the total footprint. The exact savings depend on orientation and dimensions; in some case...

work page 2026
[61]

Those two coils extend above the ash-colored spring

Bottom mid spring MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: The ash-colored thicker spring in the center will be shorter than the thin, taller silver spring just behind it (top-center) and the long, dark-gray spring at the right edge when all are stood upright. ...

work page 2026
[62]

Physically nothing about the bow changes — its size, shape and clasp strength remain the same as the other similar clips in front, so its holding capacity is unchanged

Small, large MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: If you move the red bow to the back it will appear smaller and more obscured by the bows in front because of perspective and overlap, and its red color will look less vivid from that angle. Physically nothin...

work page 2026
[63]

The slats break the light into parallel bands, producing the striped pattern

Not applicable MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: Sunlight streaming through window blinds (slatted shades) is casting the diagonal striped shadow on the wall. The slats break the light into parallel bands, producing the striped pattern. • Gemini-2.5-Flas...

work page 2026
[64]

A broom is attached to the cart and it sits on the pavement in the foreground

Woman MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The frontmost object is a small orange street-cleaning cart with a blue plastic bin mounted on it. A broom is attached to the cart and it sits on the pavement in the foreground. • Gemini-2.5-Flash: The frontmost ob...

work page 2026
[65]

It sits slightly to the left and behind the device

Not applicable MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The ball directly behind the white, square-shaped device is blue (a cyan/teal colored ball). It sits slightly to the left and behind the device. • Gemini-2.5-Flash: The ball directly behind the white-color...

work page 2026
[66]

A broom is attached to the cart and it sits on the pavement in the foreground

Lamp chimney MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The frontmost object is a small orange street-cleaning cart with a blue plastic bin mounted on it. A broom is attached to the cart and it sits on the pavement in the foreground. • Gemini-2.5-Flash: The front...

work page 2026
[67]

Sunny Day Kit

4 MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 4 Open-ended Answers: • GPT-5-mini: Three. The glowing bulb being held clearly shows a filament, and two more filament bulbs are visible inside the round box on the left. • Gemini-2.5-Flash: There are three objects with filaments inside them. One is the ...

work page 2026
[68]

The Route 6 shield is labeled "EAST" and the straight-ahead arrow beneath it shows that going straight follows eastbound Route 6

West MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 3 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: You would be traveling east. The Route 6 shield is labeled "EAST" and the straight-ahead arrow beneath it shows that going straight follows eastbound Route 6. • Gemini-2.5-Flash: If you go straight following route 6...

work page 2026
[69]

In other words, it faces away from the wall toward the center of the room

Toward MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 4 Open-ended Answers: • GPT-5-mini: If the portrait is on the left wall and faces directly into the room, it is oriented perpendicular to that wall and pointing inward. In other words, it faces away from the wall toward the center of the room. • Gem...

work page 2026
[70]

The two covers are opened in opposite directions and lie nearly flat, forming an almost straight line

120 MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: About 180°. The two covers are opened in opposite directions and lie nearly flat, forming an almost straight line. • Gemini-2.5-Flash: The cases of the pocket watch are open, revealing the intricate mechanism inside....

work page 2026
[71]

Behind the building and plants

work page
[72]

In front of the plants and leads towards the building

work page
[73]

The road is beside the building, with no clear direction

work page
[74]

It separates and encircles the entrance and decorative planters, forming a driveway/roundabout in the courtyard

The road is stacked above the building and plants MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: The road runs in front of the large white building, curving around a central planted island and seating area. It separates and encircles the entrance and decorative plant...

work page 2026
[75]

Stencil MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: Those flowing, tapered curves are most characteristic of a paintbrush (or broad paint marker) rather than a spray can. The handedness of the brush user — most likely right-handed — would be most evident, since ri...

work page 2026
[76]

It sits almost directly in the middle of the composition, slightly left of the exact center

Berries MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: The red bowl of berries (raspberries and blueberries) is closest to the image's centerline. It sits almost directly in the middle of the composition, slightly left of the exact center.. • Gemini-2.5-Flash: The re...

work page 2026
[77]

A monument of various statues

work page
[78]

It appears to be one of the sculpted figures that are part of the fountain in the foreground

Humans MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Between the building and the street lamp stands a stone statue of a person. It appears to be one of the sculpted figures that are part of the fountain in the foreground. • Gemini-2.5-Flash: The object located betw...

work page 2026
[79]

It’s hanging over the water with its claws visible

Nothing MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: B Open-ended Answers: • GPT-5-mini: A crab is in the top left corner of the plant, perched on the rim of the fishbowl. It’s hanging over the water with its claws visible. • Gemini-2.5-Flash: The plant in the image is a type of seaweed or coral, and...

work page 2026
[80]

american marriage

Cheetah MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 3 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Directly behind the cheetah is a white open-sided tour vehicle or tram. Several people are seated inside the tram, watching and taking photos. • Gemini-2.5-Flash: Directly behind the cheetah, there is a white ope...

work page 2026

Showing first 80 references.