pith. sign in

arxiv: 2606.03577 · v1 · pith:YY66TOOCnew · submitted 2026-06-02 · 💻 cs.CV

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Pith reviewed 2026-06-28 10:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords wide-baseline matchingspatial reasoningmultimodal large language modelsreinforcement learningReasonMatch-BenchDCRLviewpoint progressioncorrespondence curriculum
0
0 comments X

The pith

Reinforcement learning on automatically extracted wide-baseline view pairs trains MLLMs to perform complex spatial matching without chain-of-thought labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models currently fail at matching fine details across large viewpoint shifts, a core requirement for spatial reasoning in physical settings. The paper builds ReasonMatch-Bench to quantify this gap across indoor, outdoor, and object scenes and constructs a pipeline that pulls verifiable wide-baseline pairs from existing video and 3D reconstruction datasets. It then introduces Dynamic Correspondence Reinforcement Learning, which applies progressive curricula first at the full-image viewpoint level and then at the individual point correspondence level to supply reward signals. If the approach holds, models gain measurable spatial competence on the new benchmark and related tasks while retaining performance on standard visual understanding tests.

Core claim

ReasonMatch-Bench reveals that existing MLLMs reach only 37.2 F1 on a hard wide-baseline subset where humans reach 84.0. A scalable extraction pipeline yields diverse, verifiable training pairs from RGB-D videos and SfM reconstructions. Dynamic Correspondence Reinforcement Learning combines Image-Level Viewpoint Progression with Point-Level Correspondence Curriculum to deliver training through verifiable rewards, producing substantial gains on ReasonMatch-Bench, positive transfer to other spatial benchmarks, and modest gains or stability on general visual understanding tasks.

What carries the argument

Dynamic Correspondence Reinforcement Learning (DCRL), which supplies reinforcement signals by progressing first across image-level viewpoint changes and then across point-level correspondences.

Load-bearing premise

The automatically extracted wide-baseline view pairs from video-3D corpora supply accurate, verifiable supervision that does not introduce systematic errors or biases into the training signal.

What would settle it

Applying DCRL to the generated pairs produces no improvement on ReasonMatch-Bench, or replacing the extracted pairs with random or noisy matches still yields the same reported gains.

Figures

Figures reproduced from arXiv: 2606.03577 by Anzhou Li, Chunhua Shen, Cong Chen, Duochao Shi, Hao Chen, Hao Zhong, Hua Geng, Muzhi Zhu, Shenyan Zeng, Tao Lin, Wentao Ye.

Figure 1
Figure 1. Figure 1: Wide-Baseline Matching exposes major spatial-reasoning gaps in current MLLMs. Right: even frontier models struggle on ReasonMatch-Bench. Middle: DCRL improves ReasonMatch and transfers to other spatial intelligence benchmarks. Left: Dataset curation pipeline from RGBD datas to different WBM question-answer pairs. duce a scalable data-generation pipeline that automatically harvests wide-baseline view pairs … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of cross-view region matching. Our method correctly preserves global spatial consistency across viewpoints, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training curves of DCRL. Left: reward curve. Right: [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt For Wide-Baseline Matching Task. ing on a single fixed difficulty level. Across spatial benchmarks, DCRL improves over the base model on all four reported tasks (+43.0 on Reason￾Match, +5.3 on SAT, +5.3 on OmniSpatial, and +3.5 on MindCube). By contrast, SFT shows mixed behavior: it im￾proves on ReasonMatch (+23.5) and MindCube (+5.1), is roughly flat on OmniSpatial (−1.0), and degrades substan￾tial… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of five models across key error dimen [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative example on ReasonMatch-Bench (Example 1). The model is given two views of the same scene and asked to match [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative example on ReasonMatch-Bench (Example 2). Another successful case where the model aligns regions across wide [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReasonMatch-Bench, a benchmark for wide-baseline matching (WBM) in MLLMs stratified by viewpoint displacement and granularity, shows that existing models perform poorly (best baseline 37.2 F1 vs. human 84.0 on a 90-sample subset), describes an automatic pipeline extracting view pairs from RGB-D and SfM video-3D corpora, and proposes Dynamic Correspondence Reinforcement Learning (DCRL) combining viewpoint progression and point-level curriculum to train via verifiable rewards, claiming substantial gains on ReasonMatch-Bench, positive transfer to related spatial tasks, and maintained general visual performance.

Significance. If the auto-extracted supervision is shown to be accurate, the scalable pipeline and DCRL approach could offer a practical route to elicit geometric and occlusion reasoning in MLLMs without explicit chain-of-thought, with potential downstream value for embodied AI and 3D perception tasks.

major comments (2)
  1. [Data Generation Pipeline] Abstract and Data Generation Pipeline: the assertion of 'diverse and verifiable supervision' from the RGB-D/SfM extraction pipeline is not accompanied by quantitative validation (e.g., reprojection error statistics, dynamic-object failure rates, or inter-annotator agreement on a held-out subset). This is load-bearing for the central claim that DCRL gains reflect improved spatial reasoning rather than adaptation to pipeline artifacts.
  2. [Experiments] Experiments section: reported improvements on ReasonMatch-Bench and transfer tasks lack error bars, exact baseline implementations, and ablations isolating the Image-Level Viewpoint Progression versus Point-Level Correspondence Curriculum components, weakening confidence that the observed deltas are robust.
minor comments (2)
  1. [Abstract] Abstract: the 90-sample difficult subset should state selection criteria and whether results generalize to the full benchmark stratification.
  2. [DCRL Method] Notation: 'verifiable rewards' is used without a precise definition of the reward function or how correspondence correctness is scored at training time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Data Generation Pipeline] Abstract and Data Generation Pipeline: the assertion of 'diverse and verifiable supervision' from the RGB-D/SfM extraction pipeline is not accompanied by quantitative validation (e.g., reprojection error statistics, dynamic-object failure rates, or inter-annotator agreement on a held-out subset). This is load-bearing for the central claim that DCRL gains reflect improved spatial reasoning rather than adaptation to pipeline artifacts.

    Authors: We agree that the current manuscript lacks explicit quantitative validation metrics for the pipeline. While the extraction relies on established SfM and RGB-D reconstruction techniques, we will add a dedicated subsection reporting reprojection error statistics, dynamic-object failure rates, and agreement metrics on a held-out validation subset to directly support the verifiability of the supervision signals. revision: yes

  2. Referee: [Experiments] Experiments section: reported improvements on ReasonMatch-Bench and transfer tasks lack error bars, exact baseline implementations, and ablations isolating the Image-Level Viewpoint Progression versus Point-Level Correspondence Curriculum components, weakening confidence that the observed deltas are robust.

    Authors: We concur that these elements are necessary for assessing robustness. The revised version will report error bars computed over multiple random seeds, provide exact implementation details and hyperparameters for all baselines, and include new ablation experiments that separately evaluate the Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum components. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains are measured independently of training pipeline definitions

full rationale

The paper's central claims consist of empirical performance improvements on the newly introduced ReasonMatch-Bench (stratified by viewpoint and granularity) plus transfer to external spatial benchmarks, obtained after training DCRL on automatically extracted wide-baseline pairs. These measured F1 scores and benchmark deltas are not shown to reduce by construction to quantities defined inside the data-generation pipeline, the verifiable-reward formulation, or any self-citation chain. No equations, fitted parameters, or uniqueness theorems are presented that equate the reported predictions to the training inputs; the derivation chain therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the RL reward definition and data extraction rules are implicit but not detailed enough to audit.

pith-pipeline@v0.9.1-grok · 5786 in / 1111 out tokens · 34234 ms · 2026-06-28T10:45:57.845959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 15 linked inside Pith

  1. [1]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 6, 7, 8

  2. [2]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 7

  3. [3]

    Hot3d: Hand and object tracking in 3d from egocentric multi-view videos

    Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, et al. Hot3d: Hand and object tracking in 3d from egocentric multi-view videos. InCVPR, pages 7061–7071, 2025. 1

  4. [4]

    Graph-cut ransac

    Daniel Barath and Jiri Matas. Graph-cut ransac. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 6733–6741, 2018. 3

  5. [5]

    Magsac: marginalizing sample consensus

    Daniel Barath, Jiri Matas, and Jana Noskova. Magsac: marginalizing sample consensus. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10197–10205, 2019. 3

  6. [6]

    Surf: Speeded up robust features

    Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on com- puter vision, pages 404–417. Springer, 2006. 3

  7. [7]

    Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024

    Aleksei Bochkovskii, Ama ˜AG ¸ l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 1

  8. [8]

    Assignment problems: revised reprint

    Rainer Burkard, Mauro Dell’Amico, and Silvano Martello. Assignment problems: revised reprint. SIAM, 2012. 3

  9. [9]

    Spatialvlm: Endowing vision-language mod- els with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language mod- els with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024. 7

  10. [10]

    Are we on the right way for evaluating large vision-language models?Advances in Neural Informa- tion Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Informa- tion Processing Systems, 37:27056–27087, 2024. 8

  11. [11]

    Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7

  12. [12]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 3, 1

  13. [13]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 337–348, 2018. 3

  14. [14]

    Are multimodal large language models ready for omnidirectional spatial reasoning?arXiv preprint arXiv:2505.11907, 2025

    Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, and Xum- ing Hu. Are multimodal large language models ready for omnidirectional spatial reasoning?arXiv preprint arXiv:2505.11907, 2025. 2

  15. [15]

    D2-net: A trainable cnn for joint detection and description of local features

    Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle- feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint detection and description of local features. InCVPR 2019-IEEE Conference on Computer Vi- sion and Pattern Recognition, 2019. 3

  16. [16]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

    Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 3

  17. [17]

    A graduated assignment algorithm for graph matching.IEEE Transactions on pattern analysis and machine intelligence, 18(4):377–388, 2002

    Steven Gold and Anand Rangarajan. A graduated assignment algorithm for graph matching.IEEE Transactions on pattern analysis and machine intelligence, 18(4):377–388, 2002. 3

  18. [18]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633– 638, 2025. 2, 3

  19. [19]

    Projective reconstruction and invariants from multiple images.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 16(10):1036–1041, 1994

    Richard I Hartley. Projective reconstruction and invariants from multiple images.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 16(10):1036–1041, 1994. 3

  20. [20]

    Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE TPAMI, 2024

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geomet- ric foundation model for zero-shot metric depth and surface normal estimation.IEEE TPAMI, 2024. 1

  21. [21]

    Notvla: Narrowing of dense action trajecto- ries for generalizable robot manipulation.arXiv preprint arXiv:2510.03895, 2025

    Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, et al. Notvla: Narrowing of dense action trajecto- ries for generalizable robot manipulation.arXiv preprint arXiv:2510.03895, 2025. 1

  22. [22]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete.arXiv preprint arXiv:2502.21257, 2025

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete.arXiv preprint arXiv:2502.21257, 2025. 7

  23. [23]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vi- sion language models.arXiv preprint arXiv:2506.03135,

  24. [24]

    Image matching across wide baselines: From paper to practice

    Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, Jiri Matas, Pascal Fua, Kwang Moo Yi, and Eduard Trulls. Image matching across wide baselines: From paper to practice. arXiv preprint arXiv:2003.01587, 2020. 3

  25. [25]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 1

  26. [26]

    Building and better understanding vision- language models: Insights and future directions

    Hugo Laurenc ¸on, Andr´es Marafioti, Victor Sanh, and L ´eo Tronchon. Building and better understanding vision- language models: Insights and future directions. InWork- shop on Responsibly Building the Next Generation of Multi- modal Foundational Models, 2024. 7

  27. [27]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, pages 22160–22169,

  28. [28]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 1, 7

  29. [29]

    Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny

    Xingchen Liu, Piyush Tayal, Jianyuan Wang, Jesus Zarzar, Tom Monnier, Konstantinos Tertikas, Jiali Duan, Antoine Toisoul, Jason Y . Zhang, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov, and David Novotny. Uncommon objects in 3d. InarXiv, 2025. 3, 1

  30. [30]

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  31. [31]

    Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60(2):91–110, 2004

    David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60(2):91–110, 2004. 3

  32. [32]

    3dsrbench: A comprehensive 3d spatial reasoning benchmark

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 6924–6934, 2025. 2

  33. [33]

    Working hard to know your neighbor’s mar- gins: Local descriptor learning loss

    Anastasiia Mishchuk, Dmytro Mishkin, Filip Radenovic, and Jiri Matas. Working hard to know your neighbor’s mar- gins: Local descriptor learning loss. InAdvances in neural information processing systems, pages 4826–4837, 2017. 3

  34. [34]

    Gpt-4 technical report.CoRR, abs/2303.08774,

    OpenAI. Gpt-4 technical report.CoRR, abs/2303.08774,

  35. [35]

    Hello gpt-4o.OpenAI Blog, 2024

    OpenAI. Hello gpt-4o.OpenAI Blog, 2024. Accessed: November 22, 2024. 7

  36. [36]

    Manivideo: Generating hand-object manipulation video with dexterous and generalizable grasping

    Youxin Pang, Ruizhi Shao, Jiajun Zhang, Hanzhang Tu, Yun Liu, Boyao Zhou, Hongwen Zhang, and Yebin Liu. Manivideo: Generating hand-object manipulation video with dexterous and generalizable grasping. InCVPR, pages 12209–12219, 2025. 1

  37. [37]

    Wide baseline stereo matching

    Philip Pritchett and Andrew Zisserman. Wide baseline stereo matching. InICCV, pages 754–760. IEEE, 1998. 1

  38. [38]

    Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kemb- havi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude train- ing for multimodal language models.arXiv preprint arXiv:2412.07755, 2024. 2, 8

  39. [39]

    Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021. 3, 1

  40. [40]

    R2d2: Repeatable and reliable detector and descrip- tor.arXiv preprint arXiv:1906.06195, 2019

    J ´erˆome Revaud, Philippe Weinzaepfel, C´esar De Souza, No´e Pion, Gabriela Csurka, Yohann Cabon, and Martin Humen- berger. R2d2: Repeatable and reliable detector and descrip- tor.arXiv preprint arXiv:1906.06195, 2019. 3

  41. [41]

    Orb: An efficient alternative to sift or surf

    Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. InInter- national conference on computer vision, pages 2564–2571. IEEE, 2011. 3

  42. [42]

    PhD thesis, INRIA, 1995

    Cordelia Schmid and Roger Mohr.Matching by local invari- ants. PhD thesis, INRIA, 1995. 1

  43. [43]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, pages 4104–4113, 2016. 3, 4, 1

  44. [44]

    Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 15768–15780, 2025. 2

  45. [45]

    Dai, Anja Hauth, Katie Millican, et al

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Gem- ini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 7

  46. [46]

    Sosnet: Second order similarity reg- ularization for local descriptor learning

    Yurun Tian, Xin Yu, Bin Fan, Fuchao Wu, Huub Heijnen, and Vassileios Balntas. Sosnet: Second order similarity reg- ularization for local descriptor learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11494–11503, 2019. 3

  47. [47]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 1

  48. [48]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.CoRR, abs/2409.12191, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.CoRR, abs/2409.12191, 2024. 1

  49. [49]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InCVPR, pages 5261–5271, 2025. 1

  50. [50]

    Site: towards spatial intelligence thorough evaluation.arXiv preprint arXiv:2505.05456,

    Wenqi Wang, Reuben Tan, Pengyue Zhu, Jianwei Yang, Zhengyuan Yang, Lijuan Wang, Andrey Kolobov, Jianfeng Gao, and Boqing Gong. Site: towards spatial intelligence thorough evaluation.arXiv preprint arXiv:2505.05456,

  51. [51]

    Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025

    Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, et al. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms.arXiv preprint arXiv:2506.14245, 2025. 2

  52. [52]

    V?: Guided visual search as a core mechanism in multimodal llms

    Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024. 8

  53. [53]

    Realworldqa: Real-world visual question answering benchmark.https://x.ai/news/grok-1.5v, 2024

    xAI. Realworldqa: Real-world visual question answering benchmark.https://x.ai/news/grok-1.5v, 2024. 8

  54. [54]

    Multi-spatialmllm: Multi-frame spatial un- derstanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025

    Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xi- aodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial un- derstanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025. 2

  55. [55]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2

  56. [56]

    Seeing from another perspective: Evaluating multi-view understanding in mllms

    Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma. Seeing from another perspective: Evaluating multi-view understanding in mllms. arXiv preprint arXiv:2504.15280, 2025. 1

  57. [57]

    Mindcube: Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chan- drasegaran, Han Liu, Ranjay Krishna, Saining Xie, Man- ling Li, Jiajun Wu, and Li Fei-Fei. Mindcube: Spa- tial mental modeling from limited views.arXiv preprint arXiv:2506.21458, 2025. 1, 8

  58. [58]

    Lmms- eval: Reality check on the evaluation of large multimodal models, 2024

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms- eval: Reality check on the evaluation of large multimodal models, 2024. 8

  59. [59]

    Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 7

  60. [60]

    Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qing- song Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 8

  61. [61]

    Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025

    Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025. 2

  62. [62]

    Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025

    Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, et al. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308, 2025. 2

  63. [63]

    Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018. 3, 1

  64. [64]

    Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.CoRR, abs/2504.10479, 2025

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.CoRR, abs/2504.10479, 2025. 1, 7

  65. [65]

    Sega- gent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories

    Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qing- pei Guo, Yang Liu, Ming Yang, and Chunhua Shen. Sega- gent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3686–3696, 2025. 1

  66. [66]

    Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025

    Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, et al. Active-o3: Empowering multimodal large language models with active perception via grpo.arXiv preprint arXiv:2505.21457, 2025. 1 Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching Supplementary Material

  67. [67]

    7 – Implementation Details:Complete specifi- cations of data generation pipeline, experimental setup, prompt template, and curriculum progression schedules

    Appendix Overview This supplementary material provides comprehensive de- tails supporting our main paper, organized as follows: Sec. 7 – Implementation Details:Complete specifi- cations of data generation pipeline, experimental setup, prompt template, and curriculum progression schedules. Sec. 8 – Additional Ablations:Extended analyses on curriculum desig...

  68. [68]

    Implementation Details 7.1. Dataset Generation Correspondence Generation from RGB-D Data.Start- ing from RGB-D videos (or RGB with known intrinsics and camera-to-world poses), we back-project every pixel xwith valid depth in imageI 1 into a 3D pointXin the world coordinate system under a static-scene assumption, and reproject it to the consecutive imageI2...

  69. [69]

    Supervised Fine-tuning

    Ablation Studies Reinforcement Learning vs. Supervised Fine-tuning. To assess our curriculum-based reinforcement learning ap- proach, we compare against supervised fine-tuning (SFT) on the same cross-view matching data. The SFT baseline uses 300 steps of supervised training with teacher-forced matching predictions, while our DCRL method employs re- inforc...

  70. [70]

    None” (F5).ReasonMatch-Bench includes regions that legitimately have no correspondence in the other view. F5 measures whether the model uses the “none

    ReasonMatch-Bench Results In this section, we report detailed results on ReasonMatch- Bench and analyze how different multimodal LLMs be- have under our cross-view region matching task. We first summarize the overall quantitative performance across sce- narios and difficulty levels and then provide a fine-grained comparison along several error dimensions ...

  71. [71]

    1": "1",

    Limitations and Future Work Despite substantial improvements over baselines, our ap- proach reveals both the promise and challenges of devel- oping human-level spatial intelligence in MLLMs. Our best model achieves 52.0% F1 compared to untrained hu- mans’ 84.0%, with the gap most pronounced on object- centric scenes where even humans struggle (27.8% vs. 6...