pith. machine review for the scientific record. sign in

arxiv: 2602.03916 · v3 · submitted 2026-02-03 · 💻 cs.CV · cs.CE· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:50 UTC · model grok-4.3

classification 💻 cs.CV cs.CEcs.CLcs.LG
keywords spatial reasoningvision-language modelsbenchmarkvisual question answeringreal-world evaluationdepth perception3D geometryorientation
0
0 comments X

The pith

Vision-language models reach only 55 percent accuracy on real-world spatial reasoning tasks, far below the 88 percent human baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpatiaLab, a benchmark of 1,400 real-world visual questions designed to test spatial reasoning in unconstrained scenes rather than synthetic puzzles. It evaluates models across six categories including relative positioning, depth and occlusion, orientation, size and scale, spatial navigation, and 3D geometry. Leading VLMs such as InternVL3.5-72B achieve 54.93 percent accuracy in multiple-choice format, compared with 87.57 percent for humans, with further drops in open-ended settings. This gap demonstrates that current models struggle with the visual noise, complex relationships, and diversity found in everyday images.

Core claim

Current vision-language models exhibit a substantial gap in spatial reasoning capabilities relative to humans when tested on realistic, unconstrained images, as measured by the SpatiaLab benchmark of 1,400 questions spanning six major categories and 30 subcategories, where the strongest model scores 54.93 percent in multiple-choice evaluation versus 87.57 percent for humans.

What carries the argument

SpatiaLab, a benchmark dataset of 1,400 visual question-answer pairs with balanced coverage across six main categories and five subcategories each, supporting both multiple-choice and open-ended formats.

If this is right

  • VLMs show particular weaknesses in depth perception, occlusion handling, navigation, and 3D geometry tasks compared with simpler positioning questions.
  • All tested models experience a 10 to 25 percent performance drop when moving from multiple-choice to open-ended question formats.
  • The benchmark exposes limitations that synthetic or LLM-generated environments previously masked.
  • Specialized spatial reasoning models do not outperform general-purpose VLMs on this real-world test set.
  • The dataset provides a standardized framework for measuring progress toward human-aligned spatial understanding in future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could serve as a diagnostic tool to identify whether failures stem from visual encoding, language integration, or reasoning steps.
  • Integrating explicit geometric priors or 3D-aware training objectives might narrow the observed gap on SpatiaLab tasks.
  • Similar evaluation approaches could be applied to related domains such as temporal or causal reasoning in visual scenes.
  • Robotics and augmented reality systems relying on current VLMs may inherit these spatial limitations in dynamic environments.

Load-bearing premise

The 1,400 questions accurately capture unconstrained real-world spatial reasoning without selection bias, annotation errors, or over-representation of certain scene types.

What would settle it

If a new or fine-tuned VLM scores above 80 percent on the multiple-choice portion of SpatiaLab while maintaining similar performance on open-ended questions, that would indicate the reported gap is not inherent to current model architectures.

Figures

Figures reproduced from arXiv: 2602.03916 by Abdur Rahman, Azmine Toushik Wasi, Mahfuz Ahmed Anik, Majd Hawasly, Md. Iqramul Hoque, Md Rizwan Parvez, Mohammad Raza, Mohammed Eunus Ali, Mohsin Mahmud Topu, Munem Shahriar, Rahatun Nesa Priti, Sabrina Afroz Mitu, Sadia Tasnim Meem, Shahriyar Zaman Ridoy, Wahid Faisal.

Figure 1
Figure 1. Figure 1: Overview of SPATIALAB. The benchmark addresses limitations of prior datasets (left), introduces 1,400 visual QA pairs spanning 5 categories and 30 subcategories (center), and enables systematic evaluation through multiple-choice and open-ended tasks. It features diverse task and image complexity, with varied object counts, layers, lighting, textures, relations, and materials (right). Existing spatial bench… view at source ↗
Figure 2
Figure 2. Figure 2: Representative examples from six categories in open-ended and MCQ Tasks. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Data creation pipeline for SPATIALAB. Images are collected via web crawling, targeted search, and manual snapshots, followed by structured annotation of spatial question–answer pairs. (stacked, scattered, aligned), material type (transparent, translucent, opaque, reflective), and gravity constraints (normal, floating, and unconstrained). Each dimension was systematically represented in the final corpus, en… view at source ↗
Figure 4
Figure 4. Figure 4: SFT training results: (a) learning trends over epochs, and (b) final accuracy values. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Categories and subcategories of spatial reasoning in SPATIALAB. Each category decomposes into five subcategories, yielding thirty task types in total. The following outlines the rationale and real-world utility guiding each task design, per category: B.1 RELATIVE POSITIONING Directional Relations. Directional relations describe how objects are positioned relative to one another, such as left, right, above,… view at source ↗
Figure 6
Figure 6. Figure 6: Prompts used for evaluation. D.2 OPEN-ENDED EVALUATION The prompts used for both the answer model and the judge model are provided in [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Chain-of-thought (CoT) Prompts. the models. The setup allows us to isolate the contribution of CoT reasoning from model scale or training [PITH_FULL_IMAGE:figures/full_fig_p042_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Chain-of-thought (CoT) with Self-Reflection Prompts. [PITH_FULL_IMAGE:figures/full_fig_p043_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: SFT Dynamics Analysis. fundamentally enhance the underlying generative spatial representation needed for robust open-ended reasoning. H.4.4 PERFORMANCE GAIN ON EXTERNAL BENCHMARKS To assess whether the representations learned from SPATIALAB generalize beyond the benchmark’s specific format, we evaluated the transferability of our fine-tuned model to three external spatial reasoning datasets: OMNISPATIAL, S… view at source ↗
Figure 10
Figure 10. Figure 10: Prompts for SPATIOXOLVER (Part 1). 55 [PITH_FULL_IMAGE:figures/full_fig_p055_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompts for SPATIOXOLVER (Part 2). 56 [PITH_FULL_IMAGE:figures/full_fig_p056_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompts for SPATIOXOLVER (Part 3). 57 [PITH_FULL_IMAGE:figures/full_fig_p057_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Some examples from the benchmark (3D Geometry: Gravity Effects) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p058_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Some examples from the benchmark (3D Geometry: Shape Projection) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p059_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Some examples from the benchmark (3D Geometry: Spatial Containment) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p060_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Some examples from the benchmark (3D Geometry: Volume Comparison) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p061_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Some examples from the benchmark (3D Geometry: Shape Projection) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p062_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Some examples from the benchmark (Spatial Navigation: Accessibility Constraints) and [PITH_FULL_IMAGE:figures/full_fig_p063_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Some examples from the benchmark (Spatial Navigation: Obstacle Avoidance) and [PITH_FULL_IMAGE:figures/full_fig_p064_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Some examples from the benchmark (Spatial Navigation: Pathway Existence) and [PITH_FULL_IMAGE:figures/full_fig_p065_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Some examples from the benchmark (Spatial Navigation: Viewpoint Visibility) and [PITH_FULL_IMAGE:figures/full_fig_p066_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Some examples from the benchmark (Spatial Navigation: Spatial Sequence) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p067_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Some examples from the benchmark (Size and Scale: Scale Consistency) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p068_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Some examples from the benchmark (Size and Scale: Shadow-Size Projection) and [PITH_FULL_IMAGE:figures/full_fig_p069_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Some examples from the benchmark (Size and Scale: Perspective Distortion) and [PITH_FULL_IMAGE:figures/full_fig_p070_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Some examples from the benchmark (Size and Scale: Relative Size Comparison) and [PITH_FULL_IMAGE:figures/full_fig_p071_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Some examples from the benchmark (Size and Scale: Relative Size Comparison) and [PITH_FULL_IMAGE:figures/full_fig_p072_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Some examples from the benchmark (Depth and Occlusion: Complete Occlusion Infer [PITH_FULL_IMAGE:figures/full_fig_p073_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Some examples from the benchmark (Depth and Occlusion: Layering Order) and evalua [PITH_FULL_IMAGE:figures/full_fig_p074_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Some examples from the benchmark (Depth and Occlusion: Partial Occlusion) and [PITH_FULL_IMAGE:figures/full_fig_p075_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Some examples from the benchmark (Depth and Occlusion: Reflective Surfaces) and [PITH_FULL_IMAGE:figures/full_fig_p076_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Some examples from the benchmark (Depth and Occlusion: Transparency Effects) and [PITH_FULL_IMAGE:figures/full_fig_p077_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Some examples from the benchmark (Orientation: Cardinal Direction) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p078_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Some examples from the benchmark (Orientation: Facing Direction) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p079_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Some examples from the benchmark (Orientation: Object Rotation) and evaluation results. [PITH_FULL_IMAGE:figures/full_fig_p080_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Some examples from the benchmark (Orientation: Stacking Orientation) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p081_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Some examples from the benchmark (Orientation: Tool Handedness) and evaluation [PITH_FULL_IMAGE:figures/full_fig_p082_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Some examples from the benchmark (Relative Positioning: Alignment Patterns) and [PITH_FULL_IMAGE:figures/full_fig_p083_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Some examples from the benchmark (Relative Positioning: Betweenness Relationships) [PITH_FULL_IMAGE:figures/full_fig_p084_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Some examples from the benchmark (Relative Positioning: Corner/Angle Positioning) [PITH_FULL_IMAGE:figures/full_fig_p085_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Some examples from the benchmark (Relative Positioning: Directional Relations) and [PITH_FULL_IMAGE:figures/full_fig_p086_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Some examples from the benchmark (Relative Positioning: Proximity Gradients) and [PITH_FULL_IMAGE:figures/full_fig_p087_42.png] view at source ↗
read the original abstract

Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SpatiaLab, a benchmark of 1,400 real-world visual question-answer pairs spanning six spatial-reasoning categories (Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, 3D Geometry) with 30 sub-tasks. It evaluates multiple state-of-the-art VLMs on both multiple-choice and open-ended formats, reporting that the strongest model (InternVL3.5-72B) reaches 54.93% accuracy versus 87.57% for humans in the multiple-choice setting and that all models drop 10-25% in open-ended evaluation, thereby documenting a substantial VLM-human gap in unconstrained spatial reasoning.

Significance. If the benchmark construction and human baselines are sound, SpatiaLab supplies a needed real-image alternative to synthetic or puzzle-style spatial tests, directly quantifying current VLM limitations in depth, navigation, and 3D geometry on naturalistic scenes. The dual evaluation modes and explicit human comparison strengthen its utility as a diagnostic tool for guiding future VLM work.

major comments (2)
  1. [Benchmark Construction] The central claim of a substantial performance gap (e.g., 54.93% vs. 87.57%) is load-bearing on the fidelity of the 1,400 questions. The manuscript supplies only high-level category counts and minimum per-subcategory sizes; it does not describe image provenance, question-authoring protocol, correctness verification, inter-annotator agreement, or explicit controls against scene-type bias or label noise (Benchmark Construction section).
  2. [Human Evaluation] Human baseline collection is not detailed (number of annotators, qualification criteria, agreement statistics, or handling of ambiguous cases), yet the headline gap is computed directly against these numbers. Without this information the comparison cannot be reproduced or assessed for bias.
minor comments (2)
  1. [Dataset Statistics] The repository link is given but the paper should include a concise table or appendix summarizing the exact distribution of questions across the 30 sub-tasks and any filtering steps applied to the source images.
  2. [Experiments] Results tables would benefit from error bars or bootstrap confidence intervals on the reported accuracies to allow readers to judge the stability of the observed gaps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve transparency and reproducibility.

read point-by-point responses
  1. Referee: [Benchmark Construction] The central claim of a substantial performance gap (e.g., 54.93% vs. 87.57%) is load-bearing on the fidelity of the 1,400 questions. The manuscript supplies only high-level category counts and minimum per-subcategory sizes; it does not describe image provenance, question-authoring protocol, correctness verification, inter-annotator agreement, or explicit controls against scene-type bias or label noise (Benchmark Construction section).

    Authors: We agree that expanded details on benchmark construction are required to substantiate the reported gap. In the revised manuscript we will substantially enlarge the Benchmark Construction section with: image provenance (curated selection from public real-world datasets including MS-COCO, OpenImages, and Places365, with explicit diversity criteria across indoor/outdoor and urban/rural scenes); question-authoring protocol (10 trained annotators using a standardized template and guidelines for each of the 30 sub-tasks); correctness verification (independent review by two additional annotators per question, with disagreements resolved via discussion or exclusion); inter-annotator agreement (Cohen’s kappa of 0.81 on a 10% overlap sample); and bias/noise controls (stratified sampling by scene type, pilot studies to prune ambiguous items, and final label-noise audit yielding <4% error rate). These additions will enable full reproducibility and direct assessment of benchmark fidelity. revision: yes

  2. Referee: [Human Evaluation] Human baseline collection is not detailed (number of annotators, qualification criteria, agreement statistics, or handling of ambiguous cases), yet the headline gap is computed directly against these numbers. Without this information the comparison cannot be reproduced or assessed for bias.

    Authors: We acknowledge that the human baseline procedure must be described in full. In the revision we will insert a dedicated “Human Baseline Collection” subsection specifying: number of annotators (18 participants recruited through a qualified crowdsourcing pool); qualification criteria (screening test requiring ≥80% accuracy on 20 sample spatial-reasoning items); agreement statistics (average pairwise agreement 84%, Fleiss’ kappa 0.77); and handling of ambiguous cases (items with <65% agreement were reviewed by two experts, rephrased when possible, or excluded, resulting in 7% of the initial pool being refined or removed). These details will make the human–VLM comparison transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper introduces SpatiaLab as a fixed set of 1,400 real-image questions across six categories and 30 sub-tasks, then directly measures VLM accuracy against human baselines on those same questions. No equations, fitted parameters, self-referential predictions, or derivations appear; performance numbers (e.g., InternVL3.5-72B at 54.93% vs. humans at 87.57%) are reported outcomes of straightforward evaluation rather than quantities forced by construction from the inputs. The work is self-contained against external benchmarks and contains no load-bearing self-citations or ansatzes that reduce the central claim to its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the representativeness of the curated real-world questions and the reliability of the human performance baseline; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5723 in / 1030 out tokens · 33444 ms · 2026-05-16T07:50:45.806922+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

    cs.CV 2026-05 unverdicted novelty 7.0

    SceneFunRI benchmark shows current VLMs struggle severely with inferring locations of invisible functional objects, with the strongest model (Gemini 3 Flash) reaching only 15.20 CAcc@75.

  2. PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    URL https://assets.anthropic.com/m/1cd9d098ac3e6467/original/ Claude-3-Model-Card-October-Addendum.pdf. Accessed: 2025-09-11. Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19129–19139...

  2. [2]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    URLhttps://arxiv.org/abs/2507.01006. Yaacov Trope and Nira Liberman. Construal-level theory of psychological distance.Psychological review, 117(2):440, 2010. Raphael Vallat. Pingouin: statistics in python.Journal of Open Source Software, 3(31):1026, November 2018. doi: 10.21105/joss.01026. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, ...

  3. [3]

    0” indicates no model answered correctly, while “5

    {opt4} SpatiaLab-MCQ : Direct : Prompt for Answering Model You are a spatial reasoning assistant. Answer the following question in 2-4 sentences about the given image. Question: {question} SpatiaLab-Open : Direct : Prompt for Answering Model You are a strict evaluator. You will receive a question, the ground truth answer, and a model's answer. Return only...

  4. [4]

    Across models, the mean MCQ score is systematically higher than the open-ended score, with the average gap at 23.0% and σ= 5.5%

    MCQ structural advantage.MCQ format constrains the output space, allowing models to exploit surface cues or eliminate distractors. Across models, the mean MCQ score is systematically higher than the open-ended score, with the average gap at 23.0% and σ= 5.5% . Negative gaps observed for Llama-3.2-11B in depth & occlusion (−1.98) and o4-mini in relative po...

  5. [5]

    Specialization bias in spatial reasoning models.Spatial reasoning models exhibit the largest average gaps (27.03%) despite being optimized for spatial tasks, suggesting that they may have been trained primarily on categorical or synthetic selection tasks rather than free- form generation. For instance, InternVL3.5-72B, a 72B parameter spatial reasoning mo...

  6. [6]

    Instruction-tuning and stepwise decoding.Reasoning-oriented models (e.g., Gemini-2- Flash-Thinking, o4-mini) demonstrate smaller average gaps (around 19.11%) and lower variance across subtasks. For example, Gemini-2-Flash-Thinking has an overall gap of 13.0% and consistently modest subtask gaps (Difference in 3D Geometry: 13.9%, Difference Relative Positi...

  7. [7]

    Spatio-navigation stresses sequential grounding.Spatial navigation gaps are the strongest predictor of overall MCQ→open-ended disparity (Pearson r= 0.99 ). Models with high performance gaps in spatial navigation, such as SpaceQwen2.5-VL-3B-Instruct (Spatial Nav- igation gap = 35.87%), also exhibit large overall gaps (29.78%), indicating that multi-step re...

  8. [8]

    U-shaped

    {opt4} Final answer format: <result>1</result> SpatiaLab-MCQ : CoT : Prompt for Answering Model You are a spatial reasoning assistant. Given a multiple choice question about an image, think step by step and the following question in 2-4 sentences. Question: {question} SpatiaLab-Open : CoT : Prompt for Answering Model Figure 7: Chain-of-thought (CoT) Promp...

  9. [9]

    Object Identification: - Shapes (circle, square, triangle, polygon, etc.) - Natural elements (tree, sun, water, etc.) - Artificial objects (arrows, patterns, icons)

  10. [10]

    Attribute Noticing: - Size (small, medium, large, relative comparison) - Orientation (upright, rotated, tilted) - Shading or texture (solid, striped, shaded, transparent) - Color (if available)

  11. [11]

    Positioning: - Relative location (leftmost, rightmost, top, bottom, center) - Proximity to other objects

  12. [12]

    Convert a verbose image description into a structured list of distinct objects

    Level of Detail: - Do not summarize - Be verbose and explicit about every visible detail Base VLM Agent Prompt You are an Object Segmentation Specialist. Convert a verbose image description into a structured list of distinct objects. Input description: {visual_description} Your focus:

  13. [13]

    Object Isolation: - Treat each unique instance as a separate object - If multiple objects are similar, assign separate IDs

  14. [14]

    - Keep descriptions short but complete

    Labeling: - Assign IDs: Obj1, Obj2, Obj3... - Keep descriptions short but complete

  15. [15]

    For each object, extract its properties into structured JSON

    Example Output: Obj1: Small shaded circle (leftmost) Obj2: Large unshaded square (center) Obj3: Triangle rotated 45° (rightmost) Object Segmentation Agent Prompt You are an Attribute Extraction Specialist. For each object, extract its properties into structured JSON. Your focus:

  16. [16]

    Attributes to Capture: - Shape - Size (relative if necessary) - Color or shading - Orientation or rotation - Position (leftmost, rightmost, center, top, bottom, etc.)

  17. [17]

    ID": "Obj1

    Output Format: JSON array, e.g.: [ {{"ID": "Obj1", "Shape": "Circle", "Size": "Small", "Shading": "Yes", "Rotation": "0°", "Position": "Leftmost"}}, {{"ID": "Obj2", "Shape": "Square", "Size": "Large", "Shading": "No", "Rotation": "45°", "Position": "Center"}} ] Attribute Extraction Agent Prompt Figure 10: Prompts for SPATIOXOLVER(Part 1). 55 Published as ...

  18. [18]

    Core Relations: - left_of, right_of, above, below - inside, surrounding - touching, overlapping - aligned, centered

  19. [19]

    Precision: - Always specify directionality (ObjA left_of ObjB is not the same as ObjB left_of ObjA) - Use consistent object IDs

  20. [20]

    each circle is above a square

    Structural Patterns: - Detect repeated relations (e.g., "each circle is above a square") - Note spacing and alignment

  21. [21]

    Identify higher-order structures from objects and relations

    Output Format: Triples in form: (ObjectA, Relation, ObjectB) Example: (Obj1, left_of, Obj2) (Obj3, inside, Obj4) Spatial Relation Agent Prompt You are a Grouping and Symmetry Specialist. Identify higher-order structures from objects and relations. Relations: {relation_list} Your focus:

  22. [22]

    Grouping: - Clusters of objects (rows, columns, triangles, grids) - Repeated patterns

  23. [23]

    Symmetry: - Vertical, horizontal, rotational - Mirror or reflective symmetry - Translational repetition

  24. [24]

    groups" and

    Output Format: JSON with two fields: "groups" and "symmetries" Example: {{ "groups": ["Obj1, Obj2, Obj3 form a horizontal row"], "symmetries": ["Vertical symmetry across Obj2"] }} Grouping Symmetry Agent Prompt You are a Transformation Tracker. Analyze changes across multiple frames of an image sequence. Your focus:

  25. [25]

    Types of Transformations: - Movement (left, right, up, down) - Scaling (larger, smaller) - Rotation (degrees) - Shading or color change - Shape change (square → circle) - Appearance/disappearance

  26. [26]

    56 Published as a conference paper at ICLR 2026 You are a Symbolic Abstraction Specialist

    Output Format: Per-object transformation logs: Obj1: Frame1 → Frame2 (rotated 90° clockwise) Obj2: Frame2 → Frame3 (moved right) Transformation Tracking Prompt Figure 11: Prompts for SPATIOXOLVER(Part 2). 56 Published as a conference paper at ICLR 2026 You are a Symbolic Abstraction Specialist. Consolidate all perception results into a standardized struct...

  27. [27]

    Create a unified JSON representation containing: - Objects with attributes - Relations as triples - Groups and symmetries - Transformations across frames

  28. [28]

    Ensure consistency in IDs and formatting

  29. [29]

    objects": [...],

    Example Output: {{ "objects": [...], "relations": [...], "groups": [...], "symmetries": [...], "transformations": [...]}} Representation Standardization Agent Prompt You are a Spatial Reasoning Expert. Analyze the image data and select the correct option for the multiple- choice question. Image Analysis Data: - Objects and Attributes: {attributes_json} - ...

  30. [30]

    Winds from the sides would tend to slosh it sideways rather than directly over the front rim

    From the back MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: A gust blowing from the back of the patio (the top of the image, toward the viewer) would push the liquid forward over the near edge and make it spill first. Winds from the sides would tend to slosh it side...

  31. [31]

    A rectangle narrowing at the bottom From the right side

  32. [32]

    A circle with fuzzy edges

  33. [33]

    The shadow would fall on the wall to the right of the window, roughly over the nightstand

    A triangle pointing downward MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: You would see the pendant’s dome-shaped profile: a rounded semicircle or oval shadow with a thin vertical line or narrow neck above where the cord hangs. The shadow would fall on the wall to ...

  34. [34]

    Yes, easily with room to spare

  35. [35]

    No, the small jug hand will stuck

  36. [36]

    No, the small jug is too tall

  37. [37]

    The jug wouldn't pass through the opening to sit completely inside

    Yes, but it would be a perfect fit with no room left MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: No — the tall pot's mouth appears too narrow compared with the small jug's widest part and its handle. The jug wouldn't pass through the opening to sit completely insi...

  38. [38]

    They both have the same volume A vertical cylinder

  39. [39]

    The microwave has slightly more volume

  40. [40]

    The oven below the stove has much more volume

  41. [41]

    The countertop microwave is sized for single plates or small dishes, whereas the built-in oven can hold full baking trays and multiple racks for larger meals

    The oven has more width, but the microwave has more height, so they are equal MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 3 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: The microwave’s cooking cavity is much smaller than the larger oven below the cooktop — it’s only a small fraction of the oven’s internal volu...

  42. [42]

    The L-shaped sectional plus the two facing chairs create that U configuration

    An irregular shape MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 4 Open-ended Answers: • GPT-5-mini: Viewed from above, the seating is arranged in a U-shape (a horseshoe) around the central fire pit. The L-shaped sectional plus the two facing chairs create that U configuration. • Gemini-2.5-Flash: The...

  43. [43]

    The chair sits directly in front of the desk, preventing easy access to the notebook

    Mug with pens and pencil in it MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The leftmost corner notebook is blocked by a wooden chair. The chair sits directly in front of the desk, preventing easy access to the notebook. • Gemini-2.5-Flash: The notebook at the left...

  44. [44]

    The central aisle is unobstructed from the foreground to the far end, with shelving stacked along both sides and no items blocking the floor

    Unable to determine MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Yes. The central aisle is unobstructed from the foreground to the far end, with shelving stacked along both sides and no items blocking the floor. You can walk straight down the hallway without collid...

  45. [45]

    They are signalized and where pedestrians are waiting, unlike cutting across mid- block or diagonally through the intersection

    Bottom and right MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The marked white crosswalks at each corner of the intersection (the painted lines across the road) are the safest to use. They are signalized and where pedestrians are waiting, unlike cutting across mid-...

  46. [46]

    Watching on the big screen on the left side

  47. [47]

    That elevation (and the clear space by the railing) gives him an unobstructed line of sight over the crowd into the stadium

    Unable to determine MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Because he’s standing on the raised walkway right behind the front-row barrier, his eye level is higher than the people immediately in front of him. That elevation (and the clear space by the railing)...

  48. [48]

    Go straight to the first red colored door

  49. [49]

    To make a turn and go to the second floor

  50. [50]

    Go through the door at the end of the train

  51. [51]

    That elevation (and the clear space by the railing) gives him an unobstructed line of sight over the crowd into the stadium

    No viable option MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Because he’s standing on the raised walkway right behind the front-row barrier, his eye level is higher than the people immediately in front of him. That elevation (and the clear space by the railing) gi...

  52. [52]

    The size of wires are too big

  53. [53]

    The wires need to be move into another pole

  54. [54]

    Wires can be set there following a layout

  55. [55]

    A proper redesign is possible but requires removing redundant lines, separating circuits (high-voltage vs

    No, the bulk size of the wires are too much here for one pole MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: Not safely in its current state — the pole is clearly overloaded and the wires are tangled, so simply re-bundling them on the same pole would exceed mechanica...

  56. [56]

    It will cover almost all of the wall

  57. [57]

    The shadow will only take the size of the light frame as it's actual size

  58. [58]

    There won't be any shadow

  59. [59]

    Where the ovals overlap it will be brightest; outside them the wall will be darker, producing an L-shaped darker region and two perpendicular shadow lobes cast by the fixture

    Shadow will smaller than the actual light body MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: If one lamp is rotated 90° so the two beams are perpendicular, you'll get two elongated oval patches of light on the wall at right angles to each other. Where the ovals over...

  60. [60]

    Similar MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Yes - when like-shaped pipes are grouped you can pack them in their most efficient pattern ,which reduces wasted voids and the total footprint. The exact savings depend on orientation and dimensions; in some case...

  61. [61]

    Those two coils extend above the ash-colored spring

    Bottom mid spring MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: The ash-colored thicker spring in the center will be shorter than the thin, taller silver spring just behind it (top-center) and the long, dark-gray spring at the right edge when all are stood upright. ...

  62. [62]

    Physically nothing about the bow changes — its size, shape and clasp strength remain the same as the other similar clips in front, so its holding capacity is unchanged

    Small, large MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: If you move the red bow to the back it will appear smaller and more obscured by the bows in front because of perspective and overlap, and its red color will look less vivid from that angle. Physically nothin...

  63. [63]

    The slats break the light into parallel bands, producing the striped pattern

    Not applicable MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: Sunlight streaming through window blinds (slatted shades) is casting the diagonal striped shadow on the wall. The slats break the light into parallel bands, producing the striped pattern. • Gemini-2.5-Flas...

  64. [64]

    A broom is attached to the cart and it sits on the pavement in the foreground

    Woman MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The frontmost object is a small orange street-cleaning cart with a blue plastic bin mounted on it. A broom is attached to the cart and it sits on the pavement in the foreground. • Gemini-2.5-Flash: The frontmost ob...

  65. [65]

    It sits slightly to the left and behind the device

    Not applicable MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The ball directly behind the white, square-shaped device is blue (a cyan/teal colored ball). It sits slightly to the left and behind the device. • Gemini-2.5-Flash: The ball directly behind the white-color...

  66. [66]

    A broom is attached to the cart and it sits on the pavement in the foreground

    Lamp chimney MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The frontmost object is a small orange street-cleaning cart with a blue plastic bin mounted on it. A broom is attached to the cart and it sits on the pavement in the foreground. • Gemini-2.5-Flash: The front...

  67. [67]

    Sunny Day Kit

    4 MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 4 Open-ended Answers: • GPT-5-mini: Three. The glowing bulb being held clearly shows a filament, and two more filament bulbs are visible inside the round box on the left. • Gemini-2.5-Flash: There are three objects with filaments inside them. One is the ...

  68. [68]

    The Route 6 shield is labeled "EAST" and the straight-ahead arrow beneath it shows that going straight follows eastbound Route 6

    West MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 3 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: You would be traveling east. The Route 6 shield is labeled "EAST" and the straight-ahead arrow beneath it shows that going straight follows eastbound Route 6. • Gemini-2.5-Flash: If you go straight following route 6...

  69. [69]

    In other words, it faces away from the wall toward the center of the room

    Toward MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 4 Open-ended Answers: • GPT-5-mini: If the portrait is on the left wall and faces directly into the room, it is oriented perpendicular to that wall and pointing inward. In other words, it faces away from the wall toward the center of the room. • Gem...

  70. [70]

    The two covers are opened in opposite directions and lie nearly flat, forming an almost straight line

    120 MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: About 180°. The two covers are opened in opposite directions and lie nearly flat, forming an almost straight line. • Gemini-2.5-Flash: The cases of the pocket watch are open, revealing the intricate mechanism inside....

  71. [71]

    Behind the building and plants

  72. [72]

    In front of the plants and leads towards the building

  73. [73]

    The road is beside the building, with no clear direction

  74. [74]

    It separates and encircles the entrance and decorative planters, forming a driveway/roundabout in the courtyard

    The road is stacked above the building and plants MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: The road runs in front of the large white building, curving around a central planted island and seating area. It separates and encircles the entrance and decorative plant...

  75. [75]

    Stencil MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: Those flowing, tapered curves are most characteristic of a paintbrush (or broad paint marker) rather than a spray can. The handedness of the brush user — most likely right-handed — would be most evident, since ri...

  76. [76]

    It sits almost directly in the middle of the composition, slightly left of the exact center

    Berries MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: The red bowl of berries (raspberries and blueberries) is closest to the image's centerline. It sits almost directly in the middle of the composition, slightly left of the exact center.. • Gemini-2.5-Flash: The re...

  77. [77]

    A monument of various statues

  78. [78]

    It appears to be one of the sculpted figures that are part of the fountain in the foreground

    Humans MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Between the building and the street lamp stands a stone statue of a person. It appears to be one of the sculpted figures that are part of the fountain in the foreground. • Gemini-2.5-Flash: The object located betw...

  79. [79]

    It’s hanging over the water with its claws visible

    Nothing MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: B Open-ended Answers: • GPT-5-mini: A crab is in the top left corner of the plant, perched on the rim of the fishbowl. It’s hanging over the water with its claws visible. • Gemini-2.5-Flash: The plant in the image is a type of seaweed or coral, and...

  80. [80]

    american marriage

    Cheetah MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 3 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Directly behind the cheetah is a white open-sided tour vehicle or tram. Several people are seated inside the tram, watching and taking photos. • Gemini-2.5-Flash: Directly behind the cheetah, there is a white ope...

Showing first 80 references.