Recognition: 2 theorem links
· Lean TheoremSpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Pith reviewed 2026-05-16 07:50 UTC · model grok-4.3
The pith
Vision-language models reach only 55 percent accuracy on real-world spatial reasoning tasks, far below the 88 percent human baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Current vision-language models exhibit a substantial gap in spatial reasoning capabilities relative to humans when tested on realistic, unconstrained images, as measured by the SpatiaLab benchmark of 1,400 questions spanning six major categories and 30 subcategories, where the strongest model scores 54.93 percent in multiple-choice evaluation versus 87.57 percent for humans.
What carries the argument
SpatiaLab, a benchmark dataset of 1,400 visual question-answer pairs with balanced coverage across six main categories and five subcategories each, supporting both multiple-choice and open-ended formats.
If this is right
- VLMs show particular weaknesses in depth perception, occlusion handling, navigation, and 3D geometry tasks compared with simpler positioning questions.
- All tested models experience a 10 to 25 percent performance drop when moving from multiple-choice to open-ended question formats.
- The benchmark exposes limitations that synthetic or LLM-generated environments previously masked.
- Specialized spatial reasoning models do not outperform general-purpose VLMs on this real-world test set.
- The dataset provides a standardized framework for measuring progress toward human-aligned spatial understanding in future models.
Where Pith is reading between the lines
- The benchmark could serve as a diagnostic tool to identify whether failures stem from visual encoding, language integration, or reasoning steps.
- Integrating explicit geometric priors or 3D-aware training objectives might narrow the observed gap on SpatiaLab tasks.
- Similar evaluation approaches could be applied to related domains such as temporal or causal reasoning in visual scenes.
- Robotics and augmented reality systems relying on current VLMs may inherit these spatial limitations in dynamic environments.
Load-bearing premise
The 1,400 questions accurately capture unconstrained real-world spatial reasoning without selection bias, annotation errors, or over-representation of certain scene types.
What would settle it
If a new or fine-tuned VLM scores above 80 percent on the multiple-choice portion of SpatiaLab while maintaining similar performance on open-ended questions, that would indicate the reported gap is not inherent to current model architectures.
Figures
read the original abstract
Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SpatiaLab, a benchmark of 1,400 real-world visual question-answer pairs spanning six spatial-reasoning categories (Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, 3D Geometry) with 30 sub-tasks. It evaluates multiple state-of-the-art VLMs on both multiple-choice and open-ended formats, reporting that the strongest model (InternVL3.5-72B) reaches 54.93% accuracy versus 87.57% for humans in the multiple-choice setting and that all models drop 10-25% in open-ended evaluation, thereby documenting a substantial VLM-human gap in unconstrained spatial reasoning.
Significance. If the benchmark construction and human baselines are sound, SpatiaLab supplies a needed real-image alternative to synthetic or puzzle-style spatial tests, directly quantifying current VLM limitations in depth, navigation, and 3D geometry on naturalistic scenes. The dual evaluation modes and explicit human comparison strengthen its utility as a diagnostic tool for guiding future VLM work.
major comments (2)
- [Benchmark Construction] The central claim of a substantial performance gap (e.g., 54.93% vs. 87.57%) is load-bearing on the fidelity of the 1,400 questions. The manuscript supplies only high-level category counts and minimum per-subcategory sizes; it does not describe image provenance, question-authoring protocol, correctness verification, inter-annotator agreement, or explicit controls against scene-type bias or label noise (Benchmark Construction section).
- [Human Evaluation] Human baseline collection is not detailed (number of annotators, qualification criteria, agreement statistics, or handling of ambiguous cases), yet the headline gap is computed directly against these numbers. Without this information the comparison cannot be reproduced or assessed for bias.
minor comments (2)
- [Dataset Statistics] The repository link is given but the paper should include a concise table or appendix summarizing the exact distribution of questions across the 30 sub-tasks and any filtering steps applied to the source images.
- [Experiments] Results tables would benefit from error bars or bootstrap confidence intervals on the reported accuracies to allow readers to judge the stability of the observed gaps.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve transparency and reproducibility.
read point-by-point responses
-
Referee: [Benchmark Construction] The central claim of a substantial performance gap (e.g., 54.93% vs. 87.57%) is load-bearing on the fidelity of the 1,400 questions. The manuscript supplies only high-level category counts and minimum per-subcategory sizes; it does not describe image provenance, question-authoring protocol, correctness verification, inter-annotator agreement, or explicit controls against scene-type bias or label noise (Benchmark Construction section).
Authors: We agree that expanded details on benchmark construction are required to substantiate the reported gap. In the revised manuscript we will substantially enlarge the Benchmark Construction section with: image provenance (curated selection from public real-world datasets including MS-COCO, OpenImages, and Places365, with explicit diversity criteria across indoor/outdoor and urban/rural scenes); question-authoring protocol (10 trained annotators using a standardized template and guidelines for each of the 30 sub-tasks); correctness verification (independent review by two additional annotators per question, with disagreements resolved via discussion or exclusion); inter-annotator agreement (Cohen’s kappa of 0.81 on a 10% overlap sample); and bias/noise controls (stratified sampling by scene type, pilot studies to prune ambiguous items, and final label-noise audit yielding <4% error rate). These additions will enable full reproducibility and direct assessment of benchmark fidelity. revision: yes
-
Referee: [Human Evaluation] Human baseline collection is not detailed (number of annotators, qualification criteria, agreement statistics, or handling of ambiguous cases), yet the headline gap is computed directly against these numbers. Without this information the comparison cannot be reproduced or assessed for bias.
Authors: We acknowledge that the human baseline procedure must be described in full. In the revision we will insert a dedicated “Human Baseline Collection” subsection specifying: number of annotators (18 participants recruited through a qualified crowdsourcing pool); qualification criteria (screening test requiring ≥80% accuracy on 20 sample spatial-reasoning items); agreement statistics (average pairwise agreement 84%, Fleiss’ kappa 0.77); and handling of ambiguous cases (items with <65% agreement were reviewed by two experts, rephrased when possible, or excluded, resulting in 7% of the initial pool being refined or removed). These details will make the human–VLM comparison transparent and reproducible. revision: yes
Circularity Check
No circularity: purely empirical benchmark evaluation
full rationale
The paper introduces SpatiaLab as a fixed set of 1,400 real-image questions across six categories and 30 sub-tasks, then directly measures VLM accuracy against human baselines on those same questions. No equations, fitted parameters, self-referential predictions, or derivations appear; performance numbers (e.g., InternVL3.5-72B at 54.93% vs. humans at 87.57%) are reported outcomes of straightforward evaluation rather than quantities forced by construction from the inputs. The work is self-contained against external benchmarks and contains no load-bearing self-citations or ansatzes that reduce the central claim to its own definitions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
SceneFunRI benchmark shows current VLMs struggle severely with inferring locations of invisible functional objects, with the strongest model (Gemini 3 Flash) reaching only 15.20 CAcc@75.
-
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
Reference graph
Works this paper leans on
-
[1]
URL https://assets.anthropic.com/m/1cd9d098ac3e6467/original/ Claude-3-Model-Card-October-Addendum.pdf. Accessed: 2025-09-11. Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19129–19139...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/bf02310555 2025
-
[2]
URLhttps://arxiv.org/abs/2507.01006. Yaacov Trope and Nira Liberman. Construal-level theory of psychological distance.Psychological review, 117(2):440, 2010. Raphael Vallat. Pingouin: statistics in python.Journal of Open Source Software, 3(31):1026, November 2018. doi: 10.21105/joss.01026. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.21105/joss.01026 2010
-
[3]
0” indicates no model answered correctly, while “5
{opt4} SpatiaLab-MCQ : Direct : Prompt for Answering Model You are a spatial reasoning assistant. Answer the following question in 2-4 sentences about the given image. Question: {question} SpatiaLab-Open : Direct : Prompt for Answering Model You are a strict evaluator. You will receive a question, the ground truth answer, and a model's answer. Return only...
-
[4]
MCQ structural advantage.MCQ format constrains the output space, allowing models to exploit surface cues or eliminate distractors. Across models, the mean MCQ score is systematically higher than the open-ended score, with the average gap at 23.0% and σ= 5.5% . Negative gaps observed for Llama-3.2-11B in depth & occlusion (−1.98) and o4-mini in relative po...
-
[5]
Specialization bias in spatial reasoning models.Spatial reasoning models exhibit the largest average gaps (27.03%) despite being optimized for spatial tasks, suggesting that they may have been trained primarily on categorical or synthetic selection tasks rather than free- form generation. For instance, InternVL3.5-72B, a 72B parameter spatial reasoning mo...
-
[6]
Instruction-tuning and stepwise decoding.Reasoning-oriented models (e.g., Gemini-2- Flash-Thinking, o4-mini) demonstrate smaller average gaps (around 19.11%) and lower variance across subtasks. For example, Gemini-2-Flash-Thinking has an overall gap of 13.0% and consistently modest subtask gaps (Difference in 3D Geometry: 13.9%, Difference Relative Positi...
-
[7]
Spatio-navigation stresses sequential grounding.Spatial navigation gaps are the strongest predictor of overall MCQ→open-ended disparity (Pearson r= 0.99 ). Models with high performance gaps in spatial navigation, such as SpaceQwen2.5-VL-3B-Instruct (Spatial Nav- igation gap = 35.87%), also exhibit large overall gaps (29.78%), indicating that multi-step re...
work page 2026
-
[8]
{opt4} Final answer format: <result>1</result> SpatiaLab-MCQ : CoT : Prompt for Answering Model You are a spatial reasoning assistant. Given a multiple choice question about an image, think step by step and the following question in 2-4 sentences. Question: {question} SpatiaLab-Open : CoT : Prompt for Answering Model Figure 7: Chain-of-thought (CoT) Promp...
work page 2026
-
[9]
Object Identification: - Shapes (circle, square, triangle, polygon, etc.) - Natural elements (tree, sun, water, etc.) - Artificial objects (arrows, patterns, icons)
-
[10]
Attribute Noticing: - Size (small, medium, large, relative comparison) - Orientation (upright, rotated, tilted) - Shading or texture (solid, striped, shaded, transparent) - Color (if available)
-
[11]
Positioning: - Relative location (leftmost, rightmost, top, bottom, center) - Proximity to other objects
-
[12]
Convert a verbose image description into a structured list of distinct objects
Level of Detail: - Do not summarize - Be verbose and explicit about every visible detail Base VLM Agent Prompt You are an Object Segmentation Specialist. Convert a verbose image description into a structured list of distinct objects. Input description: {visual_description} Your focus:
-
[13]
Object Isolation: - Treat each unique instance as a separate object - If multiple objects are similar, assign separate IDs
-
[14]
- Keep descriptions short but complete
Labeling: - Assign IDs: Obj1, Obj2, Obj3... - Keep descriptions short but complete
-
[15]
For each object, extract its properties into structured JSON
Example Output: Obj1: Small shaded circle (leftmost) Obj2: Large unshaded square (center) Obj3: Triangle rotated 45° (rightmost) Object Segmentation Agent Prompt You are an Attribute Extraction Specialist. For each object, extract its properties into structured JSON. Your focus:
-
[16]
Attributes to Capture: - Shape - Size (relative if necessary) - Color or shading - Orientation or rotation - Position (leftmost, rightmost, center, top, bottom, etc.)
-
[17]
Output Format: JSON array, e.g.: [ {{"ID": "Obj1", "Shape": "Circle", "Size": "Small", "Shading": "Yes", "Rotation": "0°", "Position": "Leftmost"}}, {{"ID": "Obj2", "Shape": "Square", "Size": "Large", "Shading": "No", "Rotation": "45°", "Position": "Center"}} ] Attribute Extraction Agent Prompt Figure 10: Prompts for SPATIOXOLVER(Part 1). 55 Published as ...
work page 2026
-
[18]
Core Relations: - left_of, right_of, above, below - inside, surrounding - touching, overlapping - aligned, centered
-
[19]
Precision: - Always specify directionality (ObjA left_of ObjB is not the same as ObjB left_of ObjA) - Use consistent object IDs
-
[20]
Structural Patterns: - Detect repeated relations (e.g., "each circle is above a square") - Note spacing and alignment
-
[21]
Identify higher-order structures from objects and relations
Output Format: Triples in form: (ObjectA, Relation, ObjectB) Example: (Obj1, left_of, Obj2) (Obj3, inside, Obj4) Spatial Relation Agent Prompt You are a Grouping and Symmetry Specialist. Identify higher-order structures from objects and relations. Relations: {relation_list} Your focus:
-
[22]
Grouping: - Clusters of objects (rows, columns, triangles, grids) - Repeated patterns
-
[23]
Symmetry: - Vertical, horizontal, rotational - Mirror or reflective symmetry - Translational repetition
-
[24]
Output Format: JSON with two fields: "groups" and "symmetries" Example: {{ "groups": ["Obj1, Obj2, Obj3 form a horizontal row"], "symmetries": ["Vertical symmetry across Obj2"] }} Grouping Symmetry Agent Prompt You are a Transformation Tracker. Analyze changes across multiple frames of an image sequence. Your focus:
-
[25]
Types of Transformations: - Movement (left, right, up, down) - Scaling (larger, smaller) - Rotation (degrees) - Shading or color change - Shape change (square → circle) - Appearance/disappearance
-
[26]
56 Published as a conference paper at ICLR 2026 You are a Symbolic Abstraction Specialist
Output Format: Per-object transformation logs: Obj1: Frame1 → Frame2 (rotated 90° clockwise) Obj2: Frame2 → Frame3 (moved right) Transformation Tracking Prompt Figure 11: Prompts for SPATIOXOLVER(Part 2). 56 Published as a conference paper at ICLR 2026 You are a Symbolic Abstraction Specialist. Consolidate all perception results into a standardized struct...
work page 2026
-
[27]
Create a unified JSON representation containing: - Objects with attributes - Relations as triples - Groups and symmetries - Transformations across frames
-
[28]
Ensure consistency in IDs and formatting
-
[29]
Example Output: {{ "objects": [...], "relations": [...], "groups": [...], "symmetries": [...], "transformations": [...]}} Representation Standardization Agent Prompt You are a Spatial Reasoning Expert. Analyze the image data and select the correct option for the multiple- choice question. Image Analysis Data: - Objects and Attributes: {attributes_json} - ...
work page 2026
-
[30]
Winds from the sides would tend to slosh it sideways rather than directly over the front rim
From the back MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: A gust blowing from the back of the patio (the top of the image, toward the viewer) would push the liquid forward over the near edge and make it spill first. Winds from the sides would tend to slosh it side...
work page 2026
-
[31]
A rectangle narrowing at the bottom From the right side
-
[32]
A circle with fuzzy edges
-
[33]
The shadow would fall on the wall to the right of the window, roughly over the nightstand
A triangle pointing downward MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: You would see the pendant’s dome-shaped profile: a rounded semicircle or oval shadow with a thin vertical line or narrow neck above where the cord hangs. The shadow would fall on the wall to ...
work page 2026
-
[34]
Yes, easily with room to spare
-
[35]
No, the small jug hand will stuck
-
[36]
No, the small jug is too tall
-
[37]
The jug wouldn't pass through the opening to sit completely inside
Yes, but it would be a perfect fit with no room left MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: No — the tall pot's mouth appears too narrow compared with the small jug's widest part and its handle. The jug wouldn't pass through the opening to sit completely insi...
work page 2026
-
[38]
They both have the same volume A vertical cylinder
-
[39]
The microwave has slightly more volume
-
[40]
The oven below the stove has much more volume
-
[41]
The oven has more width, but the microwave has more height, so they are equal MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 3 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: The microwave’s cooking cavity is much smaller than the larger oven below the cooktop — it’s only a small fraction of the oven’s internal volu...
work page 2026
-
[42]
The L-shaped sectional plus the two facing chairs create that U configuration
An irregular shape MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 4 Open-ended Answers: • GPT-5-mini: Viewed from above, the seating is arranged in a U-shape (a horseshoe) around the central fire pit. The L-shaped sectional plus the two facing chairs create that U configuration. • Gemini-2.5-Flash: The...
work page 2026
-
[43]
The chair sits directly in front of the desk, preventing easy access to the notebook
Mug with pens and pencil in it MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The leftmost corner notebook is blocked by a wooden chair. The chair sits directly in front of the desk, preventing easy access to the notebook. • Gemini-2.5-Flash: The notebook at the left...
work page 2026
-
[44]
Unable to determine MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Yes. The central aisle is unobstructed from the foreground to the far end, with shelving stacked along both sides and no items blocking the floor. You can walk straight down the hallway without collid...
work page 2026
-
[45]
Bottom and right MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The marked white crosswalks at each corner of the intersection (the painted lines across the road) are the safest to use. They are signalized and where pedestrians are waiting, unlike cutting across mid-...
work page 2026
-
[46]
Watching on the big screen on the left side
-
[47]
Unable to determine MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Because he’s standing on the raised walkway right behind the front-row barrier, his eye level is higher than the people immediately in front of him. That elevation (and the clear space by the railing)...
work page 2026
-
[48]
Go straight to the first red colored door
-
[49]
To make a turn and go to the second floor
-
[50]
Go through the door at the end of the train
-
[51]
No viable option MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Because he’s standing on the raised walkway right behind the front-row barrier, his eye level is higher than the people immediately in front of him. That elevation (and the clear space by the railing) gi...
work page 2026
-
[52]
The size of wires are too big
-
[53]
The wires need to be move into another pole
-
[54]
Wires can be set there following a layout
-
[55]
No, the bulk size of the wires are too much here for one pole MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: Not safely in its current state — the pole is clearly overloaded and the wires are tangled, so simply re-bundling them on the same pole would exceed mechanica...
work page 2026
-
[56]
It will cover almost all of the wall
-
[57]
The shadow will only take the size of the light frame as it's actual size
-
[58]
There won't be any shadow
-
[59]
Shadow will smaller than the actual light body MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: If one lamp is rotated 90° so the two beams are perpendicular, you'll get two elongated oval patches of light on the wall at right angles to each other. Where the ovals over...
work page 2026
-
[60]
Similar MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Yes - when like-shaped pipes are grouped you can pack them in their most efficient pattern ,which reduces wasted voids and the total footprint. The exact savings depend on orientation and dimensions; in some case...
work page 2026
-
[61]
Those two coils extend above the ash-colored spring
Bottom mid spring MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: The ash-colored thicker spring in the center will be shorter than the thin, taller silver spring just behind it (top-center) and the long, dark-gray spring at the right edge when all are stood upright. ...
work page 2026
-
[62]
Small, large MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: If you move the red bow to the back it will appear smaller and more obscured by the bows in front because of perspective and overlap, and its red color will look less vivid from that angle. Physically nothin...
work page 2026
-
[63]
The slats break the light into parallel bands, producing the striped pattern
Not applicable MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: Sunlight streaming through window blinds (slatted shades) is casting the diagonal striped shadow on the wall. The slats break the light into parallel bands, producing the striped pattern. • Gemini-2.5-Flas...
work page 2026
-
[64]
A broom is attached to the cart and it sits on the pavement in the foreground
Woman MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The frontmost object is a small orange street-cleaning cart with a blue plastic bin mounted on it. A broom is attached to the cart and it sits on the pavement in the foreground. • Gemini-2.5-Flash: The frontmost ob...
work page 2026
-
[65]
It sits slightly to the left and behind the device
Not applicable MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The ball directly behind the white, square-shaped device is blue (a cyan/teal colored ball). It sits slightly to the left and behind the device. • Gemini-2.5-Flash: The ball directly behind the white-color...
work page 2026
-
[66]
A broom is attached to the cart and it sits on the pavement in the foreground
Lamp chimney MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: The frontmost object is a small orange street-cleaning cart with a blue plastic bin mounted on it. A broom is attached to the cart and it sits on the pavement in the foreground. • Gemini-2.5-Flash: The front...
work page 2026
-
[67]
4 MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 4 Open-ended Answers: • GPT-5-mini: Three. The glowing bulb being held clearly shows a filament, and two more filament bulbs are visible inside the round box on the left. • Gemini-2.5-Flash: There are three objects with filaments inside them. One is the ...
work page 2026
-
[68]
West MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 3 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: You would be traveling east. The Route 6 shield is labeled "EAST" and the straight-ahead arrow beneath it shows that going straight follows eastbound Route 6. • Gemini-2.5-Flash: If you go straight following route 6...
work page 2026
-
[69]
In other words, it faces away from the wall toward the center of the room
Toward MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 4 Open-ended Answers: • GPT-5-mini: If the portrait is on the left wall and faces directly into the room, it is oriented perpendicular to that wall and pointing inward. In other words, it faces away from the wall toward the center of the room. • Gem...
work page 2026
-
[70]
120 MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: About 180°. The two covers are opened in opposite directions and lie nearly flat, forming an almost straight line. • Gemini-2.5-Flash: The cases of the pocket watch are open, revealing the intricate mechanism inside....
work page 2026
-
[71]
Behind the building and plants
-
[72]
In front of the plants and leads towards the building
-
[73]
The road is beside the building, with no clear direction
-
[74]
The road is stacked above the building and plants MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: The road runs in front of the large white building, curving around a central planted island and seating area. It separates and encircles the entrance and decorative plant...
work page 2026
-
[75]
Stencil MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: 2 Open-ended Answers: • GPT-5-mini: Those flowing, tapered curves are most characteristic of a paintbrush (or broad paint marker) rather than a spray can. The handedness of the brush user — most likely right-handed — would be most evident, since ri...
work page 2026
-
[76]
It sits almost directly in the middle of the composition, slightly left of the exact center
Berries MCQ Answers: • GPT-5-mini: 4 • Gemini-2.5-Flash: 4 • InternVL3.5-72B: 4 • Kimi-VL-A3B-Thinking-2506: 3 Open-ended Answers: • GPT-5-mini: The red bowl of berries (raspberries and blueberries) is closest to the image's centerline. It sits almost directly in the middle of the composition, slightly left of the exact center.. • Gemini-2.5-Flash: The re...
work page 2026
-
[77]
A monument of various statues
-
[78]
It appears to be one of the sculpted figures that are part of the fountain in the foreground
Humans MCQ Answers: • GPT-5-mini: 1 • Gemini-2.5-Flash: 1 • InternVL3.5-72B: 1 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Between the building and the street lamp stands a stone statue of a person. It appears to be one of the sculpted figures that are part of the fountain in the foreground. • Gemini-2.5-Flash: The object located betw...
work page 2026
-
[79]
It’s hanging over the water with its claws visible
Nothing MCQ Answers: • GPT-5-mini: 2 • Gemini-2.5-Flash: 2 • InternVL3.5-72B: 2 • Kimi-VL-A3B-Thinking-2506: B Open-ended Answers: • GPT-5-mini: A crab is in the top left corner of the plant, perched on the rim of the fishbowl. It’s hanging over the water with its claws visible. • Gemini-2.5-Flash: The plant in the image is a type of seaweed or coral, and...
work page 2026
-
[80]
Cheetah MCQ Answers: • GPT-5-mini: 3 • Gemini-2.5-Flash: 3 • InternVL3.5-72B: 3 • Kimi-VL-A3B-Thinking-2506: 1 Open-ended Answers: • GPT-5-mini: Directly behind the cheetah is a white open-sided tour vehicle or tram. Several people are seated inside the tram, watching and taking photos. • Gemini-2.5-Flash: Directly behind the cheetah, there is a white ope...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.