pith. sign in

arxiv: 2606.11918 · v2 · pith:T2CDQXRPnew · submitted 2026-06-10 · 💻 cs.AI

The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

Pith reviewed 2026-06-27 09:44 UTC · model grok-4.3

classification 💻 cs.AI
keywords spatial reasoningconsistency trainingself-supervised RLlarge reasoning modelsgeometric transformationsoptimal transportlabel-free learning
0
0 comments X

The pith

Pre-trained models reach near-supervised spatial reasoning accuracy by enforcing consistency under geometric transformations without labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that spatial reasoning skills already exist inside pre-trained large reasoning models and need only alignment through logical checks rather than new labeled data. It introduces a self-supervised reinforcement learning setup where reward functions verify geometric and semantic consistency after image flips or text swaps. A tailored optimal transport variant of group relative policy optimization drives the training. This label-free process matches the accuracy and generalization of supervised fine-tuning on multiple tasks and domains.

Core claim

The paper claims that formalizing consistency verifiers as reward functions for geometric and semantic coherence under transformations, combined with an optimal transport-based RL strategy called OT-GRPO, allows self-supervised training to improve spatial reasoning in large reasoning models to levels approaching those achieved by supervised fine-tuning on ground-truth data, while preserving similar generalization.

What carries the argument

Consistency verifiers that reward logical coherence under 2D and 3D geometric constraints, implemented through the OT-GRPO reinforcement learning variant.

If this is right

  • Label-free consistency training approaches the accuracy of models trained with ground-truth supervision.
  • The method achieves comparable generalization across diverse tasks and data domains.
  • Training targets the internal reasoning process directly without external annotations.
  • Both image transformations such as flipping and textual transformations such as object swaps serve as effective consistency signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The work suggests that apparent deficits in model spatial abilities may often be alignment problems rather than missing knowledge.
  • Similar consistency-based alignment could apply to other reasoning areas where transformations yield verifiable logical constraints.
  • Adopting this approach might reduce dependence on external synthetic data generators for spatial tasks.

Load-bearing premise

Spatial reasoning capabilities are already present in pre-trained large reasoning models and can be aligned through logical coherence under geometric constraints.

What would settle it

If models trained with consistency verifiers show no accuracy gain over base models or fall measurably short of supervised baselines on held-out spatial reasoning tasks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2606.11918 by Federico Tombari, Leonidas Guibas, Maks Ovsjanikov, Marta Tintore Gazulla, Theo Uscidda.

Figure 1
Figure 1. Figure 1: Example of consistency verifier. Given a prompt asking whether object A is left of object B, we apply transformations (horizontal flip on the image, reformulation on the question) to create an augmented prompt. The consistency verifier checks whether the model’s answers satisfy the expected relationship — here, disagreement — without requiring ground-truth labels. For instance, if the model answers True on… view at source ↗
Figure 2
Figure 2. Figure 2: Same-task evaluation accuracy on SUN RGB-D (indoor) data for two model sizes (3B and 7B) and four tasks (Depth, Orientation, Size and Relative Distance). We separately train models using either an accuracy verifier (requires ground-truth labels) or a consistency verifier (no labels needed), then evaluate on held-out test samples. Despite never seeing any ground-truth labels during training, consistency-tra… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-task transfer on SUN RGB-D (7B model). Each cell shows the improvement over the pre-training baseline (in percentage points) when training on the row task and evaluating on the column task. Off-diagonal cells (colored) show cross-task transfer; diagonal cells (gray) show same-task performance for reference. The rightmost panel shows the average off-diagonal improvement: consistency training nearly ma… view at source ↗
Figure 4
Figure 4. Figure 4: Cross-domain transfer from SUN RGB-D to KITTI (7B model). Models are trained on indoor scenes (SUN RGB-D) and evaluated on outdoor driving scenes (KITTI). Diagonal cells (gray) show same-task cross-domain transfer; off-diagonal cells (colored) combine both task and domain shifts. Despite the visual gap between indoor and outdoor environments, both training methods generalize well. Consistency training near… view at source ↗
Figure 7
Figure 7. Figure 7: Extension to numeric tasks. Numeric accuracy verifier verif(y, y⋆ ) = max(0, 1−|y −y ⋆ |/y⋆ ), reported as a percentage, on counting (integer object counts) and absolute distance estima￾tion (meters). Consistency closely tracks accuracy on both tasks, trailing by 0.5pp on counting and 2.3pp on absolute distance. consistency overtakes accuracy outright, with gaps of 1.2pp (Acc. 82.6%), 4.2pp (79.6%), and 7.… view at source ↗
Figure 6
Figure 6. Figure 6: Robustness to label corruption. “Acc. +N%” flips N% of training labels uniformly at random in accuracy training; consis￾tency training uses no labels at all (orange dashed line). Accuracy still edges out consistency at 10% corruption, but the label-free signal overtakes it from 20% onward. 5.4. Robustness to Label Corruption Consistency overtakes accuracy from 20% noise. Reusing the all-tasks protocol abov… view at source ↗
Figure 8
Figure 8. Figure 8: Ablation on pairing strategy for consistency training (7B model, SUN RGB-D). Each point compares minimal pairing (y-axis) against an alternative strategy (x-axis): random (purple circles) or one-to-all (cyan triangles). Points above the diagonal indicate minimal pairing outperforms the alternative. Left: Self-task accuracy—each point is one of the four tasks, showing that minimal pairing achieves higher ac… view at source ↗
Figure 9
Figure 9. Figure 9: Expected per-completion reward under the random baseline as a function of group size K. Random pairing and one-to-all remain at 1/2 regardless of K, while minimal consistency decays as 1/ √ πK. Interpretation. Only minimal consistency penalizes uninformative models: as K grows, the expected reward for a random guesser vanishes as O(1/ √ K). The key insight is that minimal consistency actively searches for … view at source ↗
Figure 10
Figure 10. Figure 10: shows the model’s reasoning before and after consistency training on a depth comparison task from SUN RGB-D. Before training, the model relies on a flawed heuristic (vertical position in the image) and produces an incorrect answer. After training, the model correctly reasons about 3D spatial relationships and arrives at the correct answer. <think> I need to determine which object is closer to the camera b… view at source ↗
Figure 11
Figure 11. Figure 11: Depth task example. The augmented prompt applies color jitter and relation swap (“closer to” → “further from”), but no horizontal flip. Since relation swap is a single equivariant transformation (one negation), the answers should differ. - object 1 = "table", marked with a red dot. - object 2 = "chair", marked with a blue dot. Is object 1 to the left of object 2? Answer: False - object 1 = "table", marked… view at source ↗
Figure 12
Figure 12. Figure 12: Orientation task example. The augmented prompt applies horizontal flip, color jitter, and relation swap (“left of” → “right of”). The flip negates the spatial relationship, and the relation swap negates the question—two equivariant transformations that cancel out, so the answers should match. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Size task example. The augmented prompt applies an aggressive crop (50–60% scale), color jitter, and a different question template—but no relation swap. All transformations are invariant (zero negations), so the answers should match. - object 1 = "chair", highlighted by a red box. - object 2 = "table", highlighted by a blue box. - object 3 = "person", highlighted by a green box. Is object 2 closer to obje… view at source ↗
Figure 14
Figure 14. Figure 14: Relative distance task example (triplet). The augmented prompt applies horizontal flip, color jitter, and relation swap (“closer to” → “further from”). The flip does not affect inter-object distances, so only the relation swap contributes a negation—the answers should differ. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: KITTI depth task example. The augmented prompt applies color jitter and relation swap (“closer to” → “further from”), but no horizontal flip. Since relation swap is a single equivariant transformation (one negation), the answers should differ. - object 1 = "car", marked with a red dot. - object 2 = "car", marked with a blue dot. Is object 1 to the left of object 2? Answer: False - object 1 = "car", marked… view at source ↗
Figure 16
Figure 16. Figure 16: KITTI orientation task example. The augmented prompt applies horizontal flip, color jitter, and relation swap (“left of” → “right of”). The flip negates the spatial relationship, and the relation swap negates the question—two equivariant transformations that cancel out, so the answers should match. - object 1 = "car", marked with a red dot. - object 2 = "cyclist", marked with a blue dot. Is object 1 bigge… view at source ↗
Figure 17
Figure 17. Figure 17: KITTI size task example. The augmented prompt applies a bounding-box-aware crop, color jitter, and a different question template—but no relation swap. All transformations are invariant (zero negations), so the answers should match. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: KITTI relative distance task example (triplet). The augmented prompt applies horizontal flip, color jitter, and relation swap (“closer to” → “further from”). The flip does not affect inter-object distances, so only the relation swap contributes a negation—the answers should differ. How many chairs are visible in the image? Answer: 3 In the image, how many chairs can you see? Answer: 3 [PITH_FULL_IMAGE:fi… view at source ↗
Figure 19
Figure 19. Figure 19: Counting task example. The augmented prompt applies an object-preserving crop, color jitter, and template resampling. All transformations are invariant for counting (the number of objects of a given class is unchanged), so the answers should match. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Absolute distance task example. The augmented prompt applies horizontal flip, color jitter, and template resampling. Mirroring and visual jitter leave the 3D distance between two objects unchanged and the question paraphrasing keeps its meaning, so the answers should match. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_20.png] view at source ↗
read the original abstract

Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers -- reward functions that check for geometric and semantic consistency under transformations -- we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript argues that spatial reasoning capabilities are already latent in pre-trained Large Reasoning Models (LRMs) and can be aligned via a self-supervised RL framework. It introduces consistency verifiers (checking geometric/semantic coherence under transformations such as image flips and object swaps) and proposes OT-GRPO, a minimal-matching optimal-transport variant of group relative policy optimization. The central empirical claim is that this label-free approach approaches the accuracy of ground-truth supervised training while achieving comparable generalization across tasks and domains.

Significance. If the results and mechanism hold, the work would be significant for demonstrating a scalable, annotation-free route to improving spatial reasoning in LRMs and for formalizing consistency-based rewards in RL. The self-supervised design and OT-GRPO formulation are technical strengths that could generalize beyond the spatial setting. The paper receives credit for avoiding parameter fitting to the target metric and for focusing on internal reasoning coherence rather than external labels.

major comments (1)
  1. [Introduction and §4 (Experiments)] Introduction and §4 (Experiments): The load-bearing claim that 'spatial reasoning capabilities are already present in pre-trained LRMs but require alignment' is not distinguished from the alternative that the consistency verifiers simply supply an indirect reward signal that teaches new behavior. No independent probe (zero-shot performance on geometric subtasks, representation analysis, or pre-training consistency checks) is reported to test this. If models produce mutually consistent yet systematically incorrect answers under the chosen transformations, the training loop would reinforce errors rather than factuality, undermining the 'alignment of latent capability' interpretation.
minor comments (2)
  1. The abstract states performance claims without quantitative results, error bars, or experimental details; these must appear in the main text with clear baselines and statistical reporting to allow verification of the 'approaches supervised accuracy' statement.
  2. [§3 (Method)] §3 (Method): Provide the precise mathematical definition of the consistency verifiers and the OT-GRPO objective (including how the optimal transport matching is computed) so that the 'minimal-matching' property can be reproduced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for this substantive comment on the central interpretive claim. We address the concern directly below and indicate where revisions will be made to strengthen the distinction between latent alignment and indirect teaching.

read point-by-point responses
  1. Referee: [Introduction and §4 (Experiments)] Introduction and §4 (Experiments): The load-bearing claim that 'spatial reasoning capabilities are already present in pre-trained LRMs but require alignment' is not distinguished from the alternative that the consistency verifiers simply supply an indirect reward signal that teaches new behavior. No independent probe (zero-shot performance on geometric subtasks, representation analysis, or pre-training consistency checks) is reported to test this. If models produce mutually consistent yet systematically incorrect answers under the chosen transformations, the training loop would reinforce errors rather than factuality, undermining the 'alignment of latent capability' interpretation.

    Authors: We agree that the manuscript does not report independent probes (zero-shot geometric subtasks, representation analysis, or pre-training consistency statistics) that would directly test the latent-capability hypothesis versus the alternative of the verifiers supplying a new reward signal. The current evidence is indirect: the label-free OT-GRPO procedure reaches accuracy levels statistically indistinguishable from ground-truth supervised training on held-out tasks and domains. If the verifiers were systematically reinforcing consistent-but-incorrect answers, we would not expect this convergence to supervised performance; the fact that it occurs suggests the consistency signal is selecting for factuality rather than arbitrary consistent errors. Nevertheless, we acknowledge this remains an inference rather than a direct test. We will revise the Introduction and add a dedicated paragraph in §4 (and the Discussion) that (a) explicitly states the alternative interpretation, (b) notes the absence of the suggested probes as a limitation, and (c) argues that the cross-task and cross-domain generalization results are more consistent with alignment than with de-novo learning. No new experiments will be added at this stage. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained.

full rationale

The paper presents a self-supervised RL framework that applies external geometric and textual transformations (image flips, object swaps) as consistency verifiers to improve spatial reasoning in pre-trained LRMs, with OT-GRPO as the optimization strategy. The claimed improvement is demonstrated empirically by comparing label-free training outcomes to ground-truth supervised baselines across tasks and domains, without any equation or step that reduces the reported accuracy gains to a fitted input, self-defined metric, or self-citation chain. The central premise that latent capabilities exist and require alignment is treated as a hypothesis tested via the method rather than presupposed by construction. No self-definitional, fitted-prediction, or uniqueness-imported patterns appear in the abstract or described approach.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that latent spatial capabilities exist and can be surfaced by consistency under transformations, plus two newly introduced algorithmic components whose independent evidence is not supplied in the abstract.

axioms (1)
  • domain assumption spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints
    Explicitly stated as the core argument contrasting with knowledge-deficit views.
invented entities (2)
  • consistency verifiers no independent evidence
    purpose: Reward functions that check for geometric and semantic consistency under transformations
    Introduced as the central training signal; no external validation provided in abstract.
  • OT-GRPO no independent evidence
    purpose: Optimal transport-based minimal-matching variant of group relative policy optimization for pairwise verifiers
    New RL strategy proposed in the work; no prior reference or external validation in abstract.

pith-pipeline@v0.9.1-grok · 5754 in / 1363 out tokens · 19706 ms · 2026-06-27T09:44:29.973086+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 1 canonical work pages

  1. [1]

    Be consistent! enhancing robust visual reasoning in LVLMs with consistency constraints

    Anonymous. Be consistent! enhancing robust visual reasoning in LVLMs with consistency constraints. ICLR 2026 Conference Submission 6260, 2025. URL https://openreview.net/forum?id=REPLACE_WITH_ID

  2. [2]

    Cycle consistency as reward: Learning image-text alignment without human preferences

    Bahng, H., Chan, C., Durand, F., and Isola, P. Cycle consistency as reward: Learning image-text alignment without human preferences. 2025

  3. [3]

    Tres observaciones sobre el \'a lgebra lineal

    Birkhoff, G. Tres observaciones sobre el \'a lgebra lineal. Universidad Nacional de Tucum \'a n Revista Serie A , 5: 0 147--151, 1946

  4. [4]

    Omni3D : A large benchmark and model for 3D object detection in the wild

    Brazil, G., Straub, J., Ravi, N., Johnson, J., and Gkioxari, G. Omni3D : A large benchmark and model for 3D object detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13154--13164, 2023

  5. [5]

    Spatialbot: Precise spatial understanding with vision language models

    Cai, W., Ponomarenko, Y., Yuan, J., Li, X., Yang, W., Dong, H., and Zhao, B. Spatialbot: Precise spatial understanding with vision language models. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9490--9498. IEEE, 2025 a

  6. [6]

    Holistic evaluation of multimodal llms on spatial intelligence

    Cai, Z., Wang, Y., Sun, Q., Wang, R., Gu, C., Yin, W., Lin, Z., Yang, Z., Wei, C., Shi, X., Deng, K., Han, X., Chen, Z., Li, J., Fan, X., Deng, H., Lu, L., Li, B., Liu, Z., Wang, Q., Lin, D., and Yang, L. Holistic evaluation of multimodal llms on spatial intelligence. arXiv preprint arXiv:2508.13142, 2025 b

  7. [7]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Driess, D., Florence, P., Sadigh, D., Guibas, L., and Xia, F. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 14455--14465, 2024 a

  8. [8]

    Chen, J. et al. Sprite: Scaling spatial reasoning in mllms through programmatic data synthesis. arXiv preprint arXiv:2512.16237, 2024 b

  9. [9]

    Chen, W. et al. Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence. arXiv preprint arXiv:2506.07966, 2025 a

  10. [10]

    On the mechanism of reasoning pattern selection in reinforcement learning for language models

    Chen, X., Li, T., and Zou, D. On the mechanism of reasoning pattern selection in reinforcement learning for language models. arXiv preprint arXiv:2506.04695, 2025 b

  11. [11]

    Spatialrgpt: Grounded spatial reasoning in vision-language models

    Cheng, A.-C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., and Molchanov, P. Spatialrgpt: Grounded spatial reasoning in vision-language models. In Advances in Neural Information Processing Systems, volume 37, 2024

  12. [12]

    Cast: Cross-modal alignment similarity test for vision language models

    Dagan, G., Loginova, O., and Batra, A. Cast: Cross-modal alignment similarity test for vision language models. In Proceedings of the 31st International Conference on Computational Linguistics, pp.\ 1387--1402, 2025

  13. [13]

    Danskin, J. M. The Theory of Max-Min and its Application to Weapons Allocation Problems. Springer, Berlin, 1966

  14. [14]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  15. [15]

    Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N

    Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., and Vayer, T. POT : P ython O ptimal T ransport. Journal of Machine Learn...

  16. [16]

    Are we ready for autonomous driving? T he KITTI vision benchmark suite

    Geiger, A., Lenz, P., and Urtasun, R. Are we ready for autonomous driving? T he KITTI vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3354--3361, 2012

  17. [17]

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H...

  18. [18]

    SSL4RL : Revisiting self-supervised learning as intrinsic reward for visual-language reasoning, 2025 b

    Guo, X., Zhou, R., Wang, Y., Zhang, Q., Zhang, C., Jegelka, S., Wang, X., Chai, J., Yin, G., Lin, W., and Wang, Y. SSL4RL : Revisiting self-supervised learning as intrinsic reward for visual-language reasoning, 2025 b . URL https://arxiv.org/abs/2510.16416

  19. [19]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

  20. [20]

    Marble: A hard benchmark for multimodal spatial reasoning and planning

    Jiang, Y., Chai, Y., Brbi \'c , M., and Moor, M. Marble: A hard benchmark for multimodal spatial reasoning and planning. arXiv preprint arXiv:2506.22992, 2025

  21. [21]

    Spatial R easoner: Towards explicit and generalizable 3d spatial reasoning, 2025

    Ma, W., Chou, Y.-C., Liu, Q., Wang, X., de Melo, C., Xie, J., and Yuille, A. Spatial R easoner: Towards explicit and generalizable 3d spatial reasoning, 2025. URL https://arxiv.org/abs/2504.20024

  22. [22]

    and Cuturi, M

    Peyr \'e , G. and Cuturi, M. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11 0 (5-6): 0 355--607, 2019

  23. [23]

    Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling

    Santambrogio, F. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Birkh \"a user, Cham, 2015

  24. [24]

    P., and Xiao, J

    Song, S., Lichtenberg, S. P., and Xiao, J. SUN RGB-D : A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 567--576, 2015

  25. [25]

    Stogiannidis, I., McDonagh, S., and Tsaftaris, S. A. Mind the gap: Benchmarking spatial reasoning in vision-language models. arXiv preprint arXiv:2503.19707, 2025

  26. [26]

    Winoground: Probing vision and language models for visio-linguistic compositionality

    Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., and Ross, C. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 5238--5248, 2022

  27. [27]

    Equivariant similarity for vision-language foundation models

    Wang, T., Lin, K., Li, L., Lin, C.-C., Yang, Z., Zhang, H., Liu, Z., and Wang, L. Equivariant similarity for vision-language foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 11998--12008, 2023

  28. [28]

    Wang, X., Jabri, A., and Efros, A. A. Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 2566--2576, 2019

  29. [29]

    Self-consistency improves chain of thought reasoning in language models

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  30. [30]

    Chain-of-thought prompting elicits reasoning in large language models, 2023

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

  31. [31]

    Visual jigsaw post-training improves mllms, 2025

    Wu, P., Zhang, Y., Diao, H., Li, B., Lu, L., and Liu, Z. Visual jigsaw post-training improves mllms, 2025. URL https://arxiv.org/abs/2509.25190

  32. [32]

    How far are VLM s from visual spatial intelligence? A benchmark-driven perspective

    Yu, S., Chen, Y., Ju, H., Jia, L., Zhang, F., Huang, S., Wu, Y., Cui, R., Ran, B., Zhang, Z., et al. How far are VLM s from visual spatial intelligence? A benchmark-driven perspective. arXiv preprint arXiv:2509.18905, 2025

  33. [33]

    Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

    Yue, Y., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

  34. [34]

    Unveiling the tapestry of consistency in large vision-language models

    Zhang, Y., Xiao, F., Huang, T., Fan, C.-K., Dong, H., Li, J., Wang, J., Cheng, K., Zhang, S., and Guo, H. Unveiling the tapestry of consistency in large vision-language models. Advances in Neural Information Processing Systems, 37: 0 118632--118653, 2024

  35. [35]

    Co-rewarding: Stable self-supervised rl for eliciting reasoning in large language models

    Zhang, Z., Zhu, J., Ge, X., Zhao, Z., Zhou, Z., Li, X., Feng, X., Yao, J., and Han, B. Co-rewarding: Stable self-supervised rl for eliciting reasoning in large language models. arXiv preprint arXiv:2508.00410, 2025

  36. [36]

    Learning to reason without external rewards, 2025

    Zhao, X., Kang, Z., Feng, A., Levine, S., and Song, D. Learning to reason without external rewards, 2025. URL https://arxiv.org/abs/2505.19590

  37. [37]

    Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., and Efros, A. A. Learning dense correspondence via 3d-guided cycle consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 117--126, 2016

  38. [38]

    Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp.\ 2223--2232, 2017

  39. [39]

    Ttrl: Test-time reinforcement learning, 2025

    Zuo, Y., Zhang, K., Sheng, L., Qu, S., Cui, G., Zhu, X., Li, H., Zhang, Y., Long, X., Hua, E., Qi, B., Sun, Y., Ma, Z., Yuan, L., Ding, N., and Zhou, B. Ttrl: Test-time reinforcement learning, 2025. URL https://arxiv.org/abs/2504.16084