arxiv: 2605.12586 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI· cs.DB

Recognition: 2 theorem links

· Lean Theorem

3D Primitives are a Spatial Language for VLMs

Alejandro Mottini, Anping Wang, Arvind Srinivasan, Florian Dubost, Junze Liu, Kai Zhong, Kun Qian, Nan Chen, Qingjun Cui, Sam Zhang, Tian Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.DB

keywords 3D primitivesvision-language modelsspatial reasoningcode generationself-supervised fine-tuningintermediate representationscene reconstruction

0 comments

The pith

Vision-language models gain spatial understanding when they reason through 3D geometric primitives written as executable code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that VLMs can already produce code to rebuild a 3D scene from basic shapes yet still fail simple spatial questions on the same image. It treats geometric primitives expressed in code as an intermediate spatial vocabulary that bridges vision and symbolic reasoning. A new benchmark measures how much performance shifts when the same model uses different code formats for the same scenes. A training-free inference method routes questions through primitive code generation and raises accuracy on both synthetic and real-image spatial tasks. A self-supervised fine-tuning procedure then turns the model's own primitive reconstructions into training data, lifting results on multiple benchmarks without human labels.

Core claim

The paper claims that 3D geometric primitives expressed in executable code serve as a transferable intermediate representation for spatial understanding in VLMs. This is shown by large variation in object-detection accuracy across scene-code languages, consistent gains from routing spatial reasoning through code generation, and further improvements obtained by distilling the model's own primitive reconstructions back into its weights.

What carries the argument

3D geometric primitives (cubes, spheres, cylinders) expressed in executable code, used as an intermediate spatial representation that the model generates and consumes.

Load-bearing premise

That the measured gains come from the specific choice of primitive shapes rather than from any code-generation step or fine-tuning process in general.

What would settle it

A controlled test in which a model prompted to output primitive code shows no spatial accuracy lift, or in which fine-tuning on non-primitive code produces equal or larger gains on the same benchmarks.

Figures

Figures reproduced from arXiv: 2605.12586 by Alejandro Mottini, Anping Wang, Arvind Srinivasan, Florian Dubost, Junze Liu, Kai Zhong, Kun Qian, Nan Chen, Qingjun Cui, Sam Zhang, Tian Wang.

**Figure 2.** Figure 2: SPATIALBABEL’s two scene domains. Top: four primitive scenes (T1, T2, T3, T5) from the 1,000-scene procedural split (fixed 10-color palette, no-major-occlusion placement). Bottom: four Hypersim photo-realistic indoor scenes (median 26 ground-truth objects per scene). fully automatic evaluation. The Hypersim split (405 scenes, 200 evaluated) contains photo-realistic indoor scenes from Roberts et al. [2021],… view at source ↗

**Figure 3.** Figure 3: The three-phase S3 -FT pipeline (Phase 1: pseudo-GT generation; Phase 2: structured annotation extraction; Phase 3: multi-task LoRA fine-tuning). QA); the value of Code-CoT relative to NL-CoT is the cross-language axis it exposes (each row’s best_cc–worst_cc gap), not headline accuracy alone. And primitive-best struggles on cluttered indoor and real-photo scenes an observation we discuss further in Section… view at source ↗

**Figure 4.** Figure 4: S3 -FT vs. Baseline on two base VLMs. Panel 1: SpatialBabel-Primitive-Reconstruct mean F1 (in-distribution). Panel 2: SpatialBabel-Primitive-QA accuracy. Panel 3: IFEval. Panel 4: mean across 12 standard VLM benchmarks (out-of-distribution) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Per-scene-code SpatialBabel-Reconstruct-Score on primitive scenes under [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Language difficulty rankings per model (baseline, primitive scenes). Each cell shows [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Same scene, same model (Gemini-2.5-Pro), four scene-code languages. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Primitive vs. Hypersim per-scene-code Reconstruct Score (best of Direct/NL-CoT) for [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Per-language F1 on 100 primitive held-out scenes for the two base VLMs, comparing Baseline, v1 (collapsed), single-view S3 -FT (our self-supervised recipe) and cross-viewpoint S3 -FT. The largest absolute gains show up on languages where the base model has the largest output-format margin to close. For Qwen3-VL, Blender Python goes from 0.470 baseline to 0.948 cross-viewpoint (+0.48), and Three.js from 0.6… view at source ↗

read the original abstract

Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own Three.js primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows VLMs improve on spatial tasks by routing through 3D primitive code generation and self-distilling from their own reconstructions, but lacks ablations isolating primitives from generic code effects.

read the letter

The main thing here is that routing VLMs through generating executable 3D primitives (cubes, spheres, cylinders in code) lifts spatial performance, with Code-CoT giving up to +6.4% on SpatialBabel-QA and +5% on CV-Bench-3D, while S3-FT self-distills from the model's own Three.js outputs for gains up to +17% on HallusionBench without any labels or teacher models. SpatialBabel is the new benchmark that tests 14 VLMs on primitive scene reconstruction across six code languages and documents big variance in the same model's F1 scores. Those are the concrete additions: the benchmark, the inference routing, and the label-free fine-tuning recipe. The gains hold across model families and the no-human-label part makes it practical for downstream work in robotics or scene understanding. The methods build directly on existing VLM code-generation lines without obvious citation holes. The soft spot is exactly the one in the stress-test note. The paper shows output format matters, but it does not run the obvious control of comparing primitive code against other structured outputs like direct coordinate lists or Python scene graphs. Without that ablation, the claim that 3D primitives are a privileged spatial vocabulary rests on the assumption that the structure itself drives the lift rather than any executable code or fine-tuning on structured data. The abstract also skips error bars and full split details, though the held-out benchmarks reduce circularity risk. This is worth sending to peer review. The empirical pattern is clear enough and the ideas are easy to test further, so referees can check the controls and stats. Researchers working on spatial reasoning in VLMs would get value from it.

Referee Report

3 major / 2 minor

Summary. The paper claims that 3D geometric primitives (cubes, spheres, cylinders) expressed in executable code function as a powerful intermediate representation for spatial understanding in VLMs. It introduces the SpatialBabel benchmark to evaluate 14 VLMs on primitive-based 3D scene reconstruction across six scene-code languages, proposes Code-CoT (a training-free inference method routing spatial reasoning through primitive code generation), and S³-FT (self-supervised fine-tuning that distills from the model's own Three.js primitive reconstructions with no human labels), reporting consistent gains such as +6.4% on SpatialBabel-QA, +5.0% on CV-Bench-3D for Code-CoT, and up to +17% on HallusionBench for S³-FT.

Significance. If the attribution to the specific primitive representation holds, the work provides a practical, label-free method to improve VLMs' spatial reasoning via an executable geometric vocabulary. The self-supervised distillation in S³-FT and the cross-model transfer are notable strengths, as is the demonstration that VLMs can produce accurate primitive reconstructions yet fail on direct spatial queries. The benchmark's language-sensitivity result offers a useful diagnostic tool for VLM coding capabilities.

major comments (3)

[§4] §4 (Code-CoT): The lifts (+6.4% SpatialBabel-QA, +5.0% CV-Bench-3D) are attributed to routing through 3D primitive code, yet the manuscript provides no ablation comparing primitive code to other executable formats (e.g., direct Python coordinate lists or non-primitive scene graphs). This control is required to isolate whether the primitive structure itself, rather than generic code generation, produces the gains.
[§5] §5 (S³-FT): The reported improvements (+4.6–8.6% on SpatialBabel-Primitive-QA, +9.7% CV-Bench-2D, +17% HallusionBench) rely on distilling from the model's own primitive reconstructions, but without a control fine-tuning on alternative structured outputs or generic self-supervision, the specific contribution of the 3D primitive vocabulary versus any structured self-supervision remains unseparated.
[§3] §3 (SpatialBabel benchmark): While F1 varies up to 5.7× across languages, the manuscript does not report error bars, statistical significance tests, or full details on data splits and exclusion criteria for the held-out benchmarks; these are needed to confirm that the cross-language and cross-method claims are robust.

minor comments (2)

The abstract states that artifacts will be released upon publication; the camera-ready version should include a concrete link or repository reference.
[§5] Notation for the six scene-code languages and the exact Three.js parsing procedure in S³-FT could be clarified with a small example table or pseudocode for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical support for our claims.

read point-by-point responses

Referee: §4 (Code-CoT): The lifts (+6.4% SpatialBabel-QA, +5.0% CV-Bench-3D) are attributed to routing through 3D primitive code, yet the manuscript provides no ablation comparing primitive code to other executable formats (e.g., direct Python coordinate lists or non-primitive scene graphs). This control is required to isolate whether the primitive structure itself, rather than generic code generation, produces the gains.

Authors: We agree that direct ablations against alternative executable formats are needed to isolate the role of the primitive structure. In the revised manuscript we will add experiments using direct Python coordinate lists and non-primitive scene graphs as controls. These will be reported alongside the existing language-sensitivity results to clarify whether gains arise specifically from the 3D geometric primitive vocabulary. revision: yes
Referee: §5 (S³-FT): The reported improvements (+4.6–8.6% on SpatialBabel-Primitive-QA, +9.7% CV-Bench-2D, +17% HallusionBench) rely on distilling from the model's own primitive reconstructions, but without a control fine-tuning on alternative structured outputs or generic self-supervision, the specific contribution of the 3D primitive vocabulary versus any structured self-supervision remains unseparated.

Authors: We acknowledge that controls for alternative structured outputs would better isolate the primitive vocabulary's contribution. We will add such experiments in revision, including fine-tuning on generic self-supervision signals. The label-free nature of S³-FT and its cross-model transfer remain core strengths, but the requested controls will be included to strengthen attribution. revision: yes
Referee: §3 (SpatialBabel benchmark): While F1 varies up to 5.7× across languages, the manuscript does not report error bars, statistical significance tests, or full details on data splits and exclusion criteria for the held-out benchmarks; these are needed to confirm that the cross-language and cross-method claims are robust.

Authors: We will add error bars and statistical significance tests for the F1 scores and other metrics in the revised manuscript. Full details on data splits, exclusion criteria, and benchmark construction will be provided in the main text or appendix. All artifacts will be released to support reproducibility of the robustness claims. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on independent benchmarks and external evaluations

full rationale

The paper's derivation introduces SpatialBabel as a new benchmark, Code-CoT as a prompting route through code generation, and S³-FT as self-supervised fine-tuning on the model's own Three.js outputs. All reported gains (+6.4% SpatialBabel-QA, +5.0% CV-Bench-3D, +4.6–17% on HallusionBench) are measured on held-out, externally defined benchmarks that are not constructed from the training data or fitted parameters. No equations, self-definitions, or load-bearing self-citations appear; the central claim that primitives function as an intermediate representation is supported by empirical lifts rather than by construction from inputs. The self-supervised loop in S³-FT creates training labels from model outputs but does not make the evaluation results tautological, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that primitive-based code generation reliably encodes spatial relations and that self-generated reconstructions provide useful supervision signals.

axioms (1)

domain assumption VLMs possess sufficient coding capability to produce executable 3D primitive scenes from images
Invoked in the description of Code-CoT and S³-FT; treated as a prerequisite rather than derived.

pith-pipeline@v0.9.0 · 5712 in / 1117 out tokens · 27277 ms · 2026-05-14T21:29:03.665353+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 from circle linking) echoes
3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 6 internal anchors

[1]

The spatial blindspot of vision-language models.arXiv preprint arXiv:2601.09954,

Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna, et al. The spatial blindspot of vision-language models.arXiv preprint arXiv:2601.09954,

work page arXiv
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas.arXiv preprint arXiv:2503.01773,

Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, and Manling Li. Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas.arXiv preprint arXiv:2503.01773,

work page arXiv
[5]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Linear mechanisms for spa- tiotemporal reasoning in vision language models.arXiv preprint arXiv:2601.12626,

Raphi Kang, Hongqiao Chen, Georgia Gkioxari, and Pietro Perona. Linear mechanisms for spa- tiotemporal reasoning in vision language models.arXiv preprint arXiv:2601.12626,

work page arXiv
[7]

Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531,

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531,

work page arXiv
[8]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

10 Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Be- yond semantics: Rediscovering spatial awareness in vision- language models.arXiv preprint arXiv:2503.17349, 2025

Jianing Qi, Jiawei Liu, Hao Tang, and Zhigang Zhu. Beyond semantics: Rediscovering spatial awareness in vision-language models.arXiv preprint arXiv:2503.17349,

work page arXiv
[10]

Design2code: Benchmarking multimodal code generation for automated front-end engineering

Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: Benchmarking multimodal code generation for automated front-end engineering. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3956–3974,

work page 2025
[11]

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, and Yunjian Zhang. Spatialbench: Bench- marking multimodal large language models for spatial cognition.arXiv preprint arXiv:2511.21471,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

scene_id

11 A Models and prompting setup We evaluatefourteen VLMs: eight proprietary APIs (Anthropic Claude Opus 4.7 / Sonnet 4.6, Ope- nAI GPT-5 / GPT-4o, Google Gemini-2.5-Pro/Flash and Gemini-3-Pro/Flash (preview) [DeepMind, 2025]) and six open-weights bases: Qwen3-VL-8B-Instruct (the S 3-FT headline base), Qwen2.5- VL-7B-Instruct (second S3-FT base) [Bai et al...

work page 2025
[14]

best_cc /worst_cc use each model’s primitive-best / primitive-worst scene-code language from Table 1.∆best-direct = best_cc−direct(pp)

Table 6: SpatialBabel-Hypersim-QA-Score (%) under four inference modes. best_cc /worst_cc use each model’s primitive-best / primitive-worst scene-code language from Table 1.∆best-direct = best_cc−direct(pp). Bold = row max. Model direct nl_cot best_cc worst_cc∆ best-direct Proprietary APIs Claude Opus 4.730.628.7 25.5 26.8(−5.1) Claude Sonnet 4.6 21.7 29....

work page 2024
[15]

single-view S3-FT lifts SpatialBabel-QA by ∼+7% averaged across modes on Qwen3-VL-8B

All three runs converge to similar val losses (0.215to0.243). Table 12: Single-view S3-FT on Qwen3-VL-8B across multiple independent random seeds.direct, nl_cot,best_code_cot (threejs)are 3-seed runs (0, 42, 1337);worst_code_cot (canonical_json)is a 2-seed run (the seed-0 checkpoint was trained with an older configuration not directly comparable on the lo...

work page 2023
[16]

Pass rates: Qwen3-VL-8B 82%, Qwen2.5-VL-7B 84%

Epochs 3 for v1 (deprecated) and single-view S 3-FT; 2 for cross-viewpoint (larger data) Sequence length 2048 tokens max Mixed precision bfloat16 Checkpoint selection best validation loss over epochs; final model saved additionally L.2 Phase-1 quality filters Generated code is accepted only if it contains all three tokens { THREE.Scene, THREE.PerspectiveC...

work page 2048