pith. sign in

arxiv: 2606.11683 · v1 · pith:TT6VTEENnew · submitted 2026-06-10 · 💻 cs.CV · cs.AI

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Pith reviewed 2026-06-27 10:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords spatial reasoningmultimodal large language modelsnovel view synthesisegocentric videoinference-time reasoning3D reconstructionvideo understanding
0
0 comments X

The pith

A two-phase process lets an MLLM form a spatial hypothesis from video then revise it after seeing synthesized complementary views from predicted 3D geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that spatial reasoning in egocentric video is limited by single camera paths and single-turn inference, so models must lean on semantic priors instead of new evidence. It proposes that reasoning should stay revisitable: an initial hypothesis can be checked or corrected once additional viewpoints appear. ReRe implements this with a training-free Reason Phase followed by a Re-reason Phase that feeds the model a novel-view video. The novel views come from a Geometry-to-Video pipeline that turns predicted 3D structure into an elevated, scene-spanning video the model can read natively. On two spatial-reasoning benchmarks the method lifts open-source multimodal models to levels previously reached only by closed proprietary systems.

Core claim

Conclusions drawn from limited video evidence remain open to revision when strategically complementary novel views, rendered from the same predicted 3D geometry, become available; feeding those views back to the same MLLM in a second inference pass produces measurable gains in spatial accuracy without any model retraining or architectural change.

What carries the argument

The ReRe two-phase framework whose Re-reason Phase consumes a Geometry-to-Video output that renders an elevated oblique video spanning the full scene from the initial 3D reconstruction.

If this is right

  • Open-source MLLMs reach performance parity with proprietary state-of-the-art systems on the tested spatial benchmarks.
  • Spatial reasoning shifts from reliance on semantic priors toward verification against additional geometric evidence.
  • The same MLLM can be used for both phases without any fine-tuning or interface changes.
  • The approach works for any video input that admits a usable 3D reconstruction step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same revisiting logic could be applied to other video tasks where initial evidence is incomplete, such as action anticipation or object permanence checks.
  • Further gains may depend on improving the upstream 3D reconstruction quality rather than on larger language models.
  • The method suggests a general inference-time pattern: generate an auxiliary representation, then let the model re-examine its own earlier output against that representation.

Load-bearing premise

The 3D geometry estimated from the original video is accurate enough that the rendered novel views supply genuine new spatial evidence rather than noise or contradictions.

What would settle it

Running the Re-reason Phase on the generated novel-view videos produces no accuracy gain or a net loss on VSI-Bench or STI-Bench relative to the single Reason Phase alone.

Figures

Figures reproduced from arXiv: 2606.11683 by Chaofan Ma, Fanqin Zeng, Jiangchao Yao, Xiaofeng Cao, Yingjie Zhou, Yue Shi, Yuhuan Yang, Zhenjie Mao.

Figure 1
Figure 1. Figure 1: ReRe enables the model to revisit its initial hypothesis under a synthesized novel view, correcting spatial reasoning errors that single-turn inference misses. Each case shows the original egocentric video (top frames) and the synthesized novel view (bottom frames), along with the model’s reasoning before (Reason Phase, blue) and after (Re-Reason Phase, red) revisiting. (a) Object Counting: The synthesized… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the ReRe Framework. Given an ego￾centric video, our method operates in two phases: (1) Reason Phase, where the MLLM forms an initial hypothesis from the original view; and (2) Re-reason Phase, where the model verifies its hypothesis against a synthesized allocentric view (Vexo). The Geometry-to-Video pipeline generates Vexo via trajectory planning and view rendering to provide complementary geo… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Geometry-to-Video Pipeline. It con￾sists of two stages: (1) Trajectory Planning, where we predict a 3D point cloud via VGGT and design a scene-spanning Oblique Sweep path; and (2) View Rendering, where we synthesize temporally coherent video frames Vexo via point-based rasterization. 3.4.1. RE-REASONING PROTOCOL The key design principle is to enable explicit self-correction: the model must … view at source ↗
Figure 4
Figure 4. Figure 4: Visual Comparison of Allocentric Trajectory Designs. (a) Oblique Sweep (Ours) follows a diagonal path through the scene center with an elevated tilt. (b) Mid-level Traverse moves horizontally along the diameter at a fixed elevation. (c) Bird’s-eye Orbit circles the scene center from a top-down perspective. Alternative Trajectories. While our framework is flexible regarding view generation strategies, we ad… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Results on VSI-BENCH. We visualize how ReRe resolves spatial ambiguities in (a)-(b) Object Counting, (c) Absolute Distance, and (d) Relative Direction. A. Qualitative Results In [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes Reason, then Re-reason (ReRe), a training-free inference-time framework for spatial reasoning from egocentric videos in MLLMs. It consists of a Reason Phase forming an initial spatial hypothesis from the input video, followed by a Re-reason Phase that verifies or revises the hypothesis using a synthesized novel-view video. The novel views are generated by a Geometry-to-Video pipeline that renders elevated, oblique perspectives from predicted 3D geometry. The central claim is that this cross-view revisiting substantially improves performance on VSI-Bench and STI-Bench, enabling open-source MLLMs to rival proprietary SOTA without model modifications.

Significance. If the experimental results hold and the pipeline is shown to be robust, the work would be significant as a practical, training-free method for addressing geometric ambiguity in video-based spatial reasoning. The approach preserves the MLLM's native video interface and avoids architectural changes, which are clear strengths. The emphasis on revisitable reasoning grounded in complementary viewpoints offers a falsifiable direction for improving inference-time performance on standard benchmarks.

major comments (3)
  1. [Abstract] Abstract: The claim that ReRe 'substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance' on VSI-Bench and STI-Bench is presented without any quantitative results, specific baselines, ablation studies, or error analysis. This absence is load-bearing because the central claim cannot be evaluated for magnitude, statistical reliability, or comparison to existing methods.
  2. [§3] §3 (Geometry-to-Video pipeline and Re-reason Phase): The effectiveness of the re-reason phase depends on the rendered novel views supplying independent, verifiable spatial evidence rather than artifacts from upstream 3D prediction errors. No analysis, threshold experiments, or robustness tests are provided to show that the pipeline remains useful when monocular depth or structure prediction deviates from ground truth by amounts typical for open-source estimators. This directly undermines the claim that cross-view revisiting improves reasoning.
  3. [Evaluation sections] Evaluation sections: The manuscript references 'extensive evaluations' but supplies no details on the 3D prediction model used, how novel views are selected for complementarity, statistical significance of reported gains, or controls isolating the contribution of the Re-reason Phase versus the original video alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript would benefit from additional quantitative details in the abstract and expanded experimental analyses to better support our claims. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that ReRe 'substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance' on VSI-Bench and STI-Bench is presented without any quantitative results, specific baselines, ablation studies, or error analysis. This absence is load-bearing because the central claim cannot be evaluated for magnitude, statistical reliability, or comparison to existing methods.

    Authors: We agree the abstract should provide concrete quantitative support. In the revised version, we will insert specific metrics (e.g., absolute and relative gains on VSI-Bench and STI-Bench versus open-source and proprietary baselines) and a concise reference to key ablations. This directly addresses the evaluability concern without altering the core claim. revision: yes

  2. Referee: [§3] §3 (Geometry-to-Video pipeline and Re-reason Phase): The effectiveness of the re-reason phase depends on the rendered novel views supplying independent, verifiable spatial evidence rather than artifacts from upstream 3D prediction errors. No analysis, threshold experiments, or robustness tests are provided to show that the pipeline remains useful when monocular depth or structure prediction deviates from ground truth by amounts typical for open-source estimators. This directly undermines the claim that cross-view revisiting improves reasoning.

    Authors: We acknowledge the importance of demonstrating robustness. We will add a dedicated analysis subsection in §3 containing threshold experiments that inject controlled errors into depth and structure predictions at levels typical of open-source monocular estimators, plus robustness tests across multiple 3D predictors. These will quantify when and how the Re-reason phase remains beneficial. revision: yes

  3. Referee: [Evaluation sections] Evaluation sections: The manuscript references 'extensive evaluations' but supplies no details on the 3D prediction model used, how novel views are selected for complementarity, statistical significance of reported gains, or controls isolating the contribution of the Re-reason Phase versus the original video alone.

    Authors: We will expand the evaluation sections to specify the exact 3D model and hyperparameters, detail the complementarity criteria and selection procedure for novel views, report statistical significance (e.g., paired t-tests or bootstrap p-values) for all gains, and include explicit controls that isolate the Re-reason Phase contribution versus the original video. These additions will make the experimental protocol fully reproducible and transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a training-free, inference-time framework (Reason Phase on original video followed by Re-reason Phase on novel views rendered via Geometry-to-Video from predicted 3D geometry) whose performance claims rest on evaluations against external public benchmarks VSI-Bench and STI-Bench. No equations, parameter fits, self-citations, or ansatzes are invoked in a load-bearing way that reduces any claimed result to the inputs by construction; the pipeline components are treated as independent external modules rather than self-referential derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters or invented entities are introduced. The method depends on domain assumptions about MLLM video processing and 3D reconstruction quality.

axioms (2)
  • domain assumption Multimodal LLMs can natively process synthesized video inputs without architectural modifications
    The Re-reason phase and pipeline design assume the MLLM handles the novel-view video directly.
  • domain assumption Predicted 3D geometry is accurate enough to generate complementary views that resolve spatial ambiguities
    The Geometry-to-Video pipeline effectiveness rests on this premise for providing verifiable evidence.

pith-pipeline@v0.9.1-grok · 5764 in / 1428 out tokens · 35458 ms · 2026-06-27T10:39:29.593781+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 9 linked inside Pith

  1. [1]

    Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

  2. [2]

    Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  3. [3]

    Im- proved visual-spatial reasoning via r1-zero-like train- ing.arXiv preprint arXiv:2504.00883, 2025

    Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Im- proved visual-spatial reasoning via r1-zero-like train- ing.arXiv preprint arXiv:2504.00883, 2025

  4. [4]

    See&trek: Training-free spatial prompting for multimodal large language model.arXiv preprint arXiv:2509.16087, 2025

    Pengteng Li, Pinhao Song, Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, and Hui Xiong. See&trek: Training-free spatial prompting for multimodal large language model.arXiv preprint arXiv:2509.16087, 2025

  5. [5]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Ri- lyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

  6. [6]

    Out of sight, not out of context? egocen- tric spatial reasoning in vlms across disjoint frames

    Sahithya Ravi, Gabriel Herbert Sarch, Vibhav Vineet, Andrew D Wilson, and Balasaravanan Thoravi Ku- maravel. Out of sight, not out of context? egocen- tric spatial reasoning in vlms across disjoint frames. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 16146–16161, 2025

  7. [7]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

  8. [8]

    Video- 3d llm: Learning position-aware video representation for 3d scene understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video- 3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025

  9. [9]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, An- drea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InPro- ceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 5294–5306, 2025

  10. [10]

    Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

  11. [11]

    Beyond flatlands: Unlocking spatial intelligence by decoupling 3d reasoning from numerical regression

    Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang, Chenzhi Li, Xinyue Zhang, and Ping Jian. Beyond flatlands: Unlocking spatial intelligence by decoupling 3d reasoning from numerical regression. arXiv preprint arXiv:2511.11239, 2025

  12. [12]

    Mllms need 3d-aware representation super- vision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

    Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation super- vision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

  13. [13]

    Spatial forcing: Implicit spatial repre- sentation alignment for vision-language-action model

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial repre- sentation alignment for vision-language-action model. arXiv preprint arXiv:2510.12276, 2025

  14. [14]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

  15. [15]

    Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

  16. [16]

    What’s” up” with vision-language models? investi- gating their struggle with spatial reasoning.arXiv preprint arXiv:2310.19785, 2023

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s” up” with vision-language models? investi- gating their struggle with spatial reasoning.arXiv preprint arXiv:2310.19785, 2023. 10 Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

  17. [17]

    3dsrbench: A comprehensive 3d spatial reasoning benchmark

    Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 6924– 6934, 2025

  18. [18]

    Open-vocabulary semantic segmenta- tion with frozen vision-language models

    Chaofan Ma, Yuhuan Yang, Yanfeng Wang, Ya Zhang, and Weidi Xie. Open-vocabulary semantic segmenta- tion with frozen vision-language models. InBritish Machine Vision Conference (BMVC), 2022

  19. [19]

    Safire: Saccade-fixation reiteration with mamba for referring image segmentation.Advances in Neural Information Processing Systems, 38:7122–7148, 2026

    Zhenjie Mao, Yang Yuhuan, Chaofan Ma, Dongsheng Jiang, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Safire: Saccade-fixation reiteration with mamba for referring image segmentation.Advances in Neural Information Processing Systems, 38:7122–7148, 2026

  20. [20]

    ReMamber: Referring image segmentation with mamba twister

    Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong, Ya Zhang, and Yanfeng Wang. ReMamber: Referring image segmentation with mamba twister. In European Conference on Computer Vision (ECCV), 2024

  21. [21]

    AttrSeg: Open- vocabulary semantic segmentation via attribute decomposition-aggregation

    Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Ya Zhang, and Yanfeng Wang. AttrSeg: Open- vocabulary semantic segmentation via attribute decomposition-aggregation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  22. [22]

    Multi- modal prototypes for open-world semantic segmenta- tion.International Journal of Computer Vision (IJCV), 2024

    Yuhuan Yang, Chaofan Ma, Chen Ju, Fei Zhang, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Multi- modal prototypes for open-world semantic segmenta- tion.International Journal of Computer Vision (IJCV), 2024

  23. [23]

    Spa- tialreasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

    Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spa- tialreasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

  24. [24]

    Spare: Enhancing spa- tial reasoning in vision-language models with syn- thetic data.arXiv preprint arXiv:2504.20648, 2025

    Michael Ogezi and Freda Shi. Spare: Enhancing spa- tial reasoning in vision-language models with syn- thetic data.arXiv preprint arXiv:2504.20648, 2025

  25. [25]

    Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14455–14465, 2024

  26. [26]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

  27. [27]

    Constraint and union for partially-supervised temporal sentence grounding.arXiv preprint arXiv:2302.09850, 2023

    Chen Ju, Haicheng Wang, Jinxiang Liu, Chaofan Ma, Ya Zhang, Peisen Zhao, Jianlong Chang, and Qi Tian. Constraint and union for partially-supervised temporal sentence grounding.arXiv preprint arXiv:2302.09850, 2023

  28. [28]

    MoMa: Modulat- ing mamba for adapting image foundation models to video recognition

    Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. MoMa: Modulat- ing mamba for adapting image foundation models to video recognition. InProceedings of the International Conference on Machine Learning (ICML), 2025

  29. [29]

    Contrast-unity for partially-supervised temporal sen- tence grounding

    Haicheng Wang, Chen Ju, Weixiong Lin, Chaofan Ma, Shuai Xiao, Ya Zhang, and Yanfeng Wang. Contrast-unity for partially-supervised temporal sen- tence grounding. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

  30. [30]

    Spatial understanding from videos: Structured prompts meet simulation data.arXiv preprint arXiv:2506.03642, 2025

    Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, and Liqiang Nie. Spatial understanding from videos: Structured prompts meet simulation data.arXiv preprint arXiv:2506.03642, 2025

  31. [31]

    Spatialprompting: Keyframe- driven zero-shot spatial reasoning with off-the-shelf multimodal large language models.arXiv preprint arXiv:2505.04911, 2025

    Shun Taguchi, Hideki Deguchi, Takumi Hamazaki, and Hiroyuki Sakai. Spatialprompting: Keyframe- driven zero-shot spatial reasoning with off-the-shelf multimodal large language models.arXiv preprint arXiv:2505.04911, 2025

  32. [32]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

  33. [33]

    Geometric granularity aware pixel-to-mesh

    Yue Shi, Bingbing Ni, Jinxian Liu, Dingyi Rong, Ye Qian, and Wenjun Zhang. Geometric granularity aware pixel-to-mesh. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13097–13106, October 2021

  34. [34]

    Darf: Depth-aware generalizable neural radiance field.Displays, 88: 102996, 2025

    Yue Shi, Dingyi Rong, Chang Chen, Chaofan Ma, Bingbing Ni, and Wenjun Zhang. Darf: Depth-aware generalizable neural radiance field.Displays, 88: 102996, 2025

  35. [35]

    Mipmap-gs: Let gaussians deform with scale-specific mipmap for anti-aliasing rendering

    Jiameng Li, Yue Shi, Jiezhang Cao, Bingbing Ni, Wen- jun Zhang, Kai Zhang, and Luc Van Gool. Mipmap-gs: Let gaussians deform with scale-specific mipmap for anti-aliasing rendering. In2025 International Confer- ence on 3D Vision (3DV), 2025

  36. [36]

    Tianhao Peng, Haochen Wang, Yuanxing Zhang, Zekun Wang, Zili Wang, Gavin Chang, Jian Yang, Shihao Li, Yanghai Wang, Xintao Wang, et al. Mvu- eval: Towards multi-video understanding evaluation 11 Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning for multimodal llms.arXiv preprint arXiv:2511.07250, 2025

  37. [37]

    Sti-bench: Are mllms ready for precise spatial-temporal world under- standing?arXiv preprint arXiv:2503.23765, 2025

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenx- iao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world under- standing?arXiv preprint arXiv:2503.23765, 2025

  38. [38]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

  39. [39]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  40. [40]

    Expanding perfor- mance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding perfor- mance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  41. [41]

    Internvl3: Exploring ad- vanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring ad- vanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  42. [42]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Bur- nell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  43. [43]

    Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  44. [44]

    Spatialladder: Progressive training for spatial reasoning in vision- language models.arXiv preprint arXiv:2510.08531, 2025

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, et al. Spatialladder: Progressive training for spatial reasoning in vision- language models.arXiv preprint arXiv:2510.08531, 2025

  45. [45]

    FastVGGT: Training-free acceleration of visual ge- ometry transformer

    You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. FastVGGT: Training-free acceleration of visual ge- ometry transformer. InInternational Conference on Learning Representations (ICLR), 2026

  46. [46]

    LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merg- ing.arXiv preprint arXiv:2512.04939, 2025

    Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, and Xiao-Xiao Long. LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merg- ing.arXiv preprint arXiv:2512.04939, 2025

  47. [47]

    Deep extreme cut: From extreme points to object segmentation

    Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont- Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object segmentation. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 616–625, 2018

  48. [48]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4015–4026, 2023

  49. [49]

    Chaofan Ma, Qisen Xu, Xiangfeng Wang, Bo Jin, Xiaoyun Zhang, Yanfeng Wang, and Ya Zhang. Boundary-aware supervoxel-level iteratively refined interactive 3D image segmentation with multi-agent re- inforcement learning.IEEE Transactions on Medical Imaging, 40(10):2563–2574, 2021

  50. [50]

    Transforming the interactive segmen- tation for medical imaging

    Wentao Liu, Chaofan Ma, Yuhuan Yang, Weidi Xie, and Ya Zhang. Transforming the interactive segmen- tation for medical imaging. InMedical Image Com- puting and Computer Assisted Intervention (MICCAI), 2022

  51. [51]

    Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InProceed- ings of the European Conference on Computer Vision (ECCV), 2024. 12 Reason, Then Re-reason: Cross-view Revisiting Im...

  52. [52]

    Annotation-free audio- visual segmentation

    Jinxiang Liu, Yu Wang, Chen Ju, Chaofan Ma, Ya Zhang, and Weidi Xie. Annotation-free audio- visual segmentation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

  53. [53]

    Unsupervised domain adaptation via similarity-based prototypes for cross-modality segmentation

    Ziyu Ye, Chen Ju, Chaofan Ma, and Xiaoyun Zhang. Unsupervised domain adaptation via similarity-based prototypes for cross-modality segmentation. InMed- ical Image Computing and Computer Assisted Inter- vention (MICCAI), 2021

  54. [54]

    Audio-aware query-enhanced transformer for audio-visual segmentation.arXiv preprint arXiv:2307.13236, 2023

    Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, and Ya Zhang. Audio-aware query-enhanced transformer for audio-visual segmentation.arXiv preprint arXiv:2307.13236, 2023

  55. [55]

    Open-vocabulary panoptic segmentation with text-to-image diffusion models

    Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xi- aolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2955–2966, 2023

  56. [56]

    Dif- fusionSeg: Adapting diffusion towards unsupervised object discovery.arXiv preprint arXiv:2303.09813, 2023

    Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Jinxi- ang Liu, Yu Wang, Ya Zhang, and Yanfeng Wang. Dif- fusionSeg: Adapting diffusion towards unsupervised object discovery.arXiv preprint arXiv:2303.09813, 2023

  57. [57]

    GenMask: Adapting DiT for segmentation via direct mask generation

    Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, and Yanfeng Wang. GenMask: Adapting DiT for segmentation via direct mask generation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  58. [58]

    Tracking the rareness of diseases: Improving long-tail medical detection with a calibrated diffusion model

    Tianjiao Zhang, Chaofan Ma, and Yanfeng Wang. Tracking the rareness of diseases: Improving long-tail medical detection with a calibrated diffusion model. Electronics, 13(23):4693, 2024

  59. [59]

    FreeSegDiff: Annotation-free saliency segmentation with diffusion models

    Chaofan Ma, Yuhuan Yang, Chen Ju, Yue Shi, Ya Zhang, and Yanfeng Wang. FreeSegDiff: Annotation-free saliency segmentation with diffusion models. InIEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025

  60. [60]

    UniReason 1.0: A unified reasoning framework for world knowledge aligned image generation and editing

    Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, and Jiaqi Wang. UniReason 1.0: A unified reasoning framework for world knowledge aligned image generation and editing. arXiv preprint arXiv:2602.02437, 2026

  61. [61]

    Inter- leaving reasoning for better text-to-image generation

    Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, and Shaohui Lin. Inter- leaving reasoning for better text-to-image generation. arXiv preprint arXiv:2509.06945, 2025

  62. [62]

    Uni-CoT: Towards unified chain- of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606, 2025

    Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-CoT: Towards unified chain- of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606, 2025

  63. [63]

    phase2 output :

    Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tian- hang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, and Jiaqi Wang. DeepGen 1.0: A lightweight unified multimodal model for advancing image gener- ation and editing.a...

  64. [64]

    **Observe** the video carefully and describe the key visual elements

  65. [65]

    **Infer** a plausible answer even if visual information is incomplete

  66. [66]

    Engage in an internal dialogue using expressions such as ’let me think’, ’wait’, ’Hmm’, ’oh, I see’, ’let’s break it down’, etc, or other natural language thought expressions

    **Conclude** with a final answer. Engage in an internal dialogue using expressions such as ’let me think’, ’wait’, ’Hmm’, ’oh, I see’, ’let’s break it down’, etc, or other natural language thought expressions. It’s encouraged to include self-reflection or verification in the reasoning process. Provide your detailed reasoning between the <think> and </thin...

  67. [67]

    new views

    **Compare** old vs. new views

  68. [68]

    If the question relates to temporal order, primarily maintain your answer from the first round

    **Reflect** on whether your prior conclusion holds. If the question relates to temporal order, primarily maintain your answer from the first round

  69. [69]

    Follow this format strictly: <think>put your Step-by-step reasoning process here</think> <answer>put your specific answer here</answer> Rules: - Do not output text outside tags

    **Confirm** your final answer. Follow this format strictly: <think>put your Step-by-step reasoning process here</think> <answer>put your specific answer here</answer> Rules: - Do not output text outside tags. - Only one final answer. - Avoid vague terms like ’around’ or ’approximately’. {Answer Format Constraint} Let’s think step by step about the compari...