Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Chaofan Ma; Fanqin Zeng; Jiangchao Yao; Xiaofeng Cao; Yingjie Zhou; Yue Shi; Yuhuan Yang; Zhenjie Mao

arxiv: 2606.11683 · v1 · pith:TT6VTEENnew · submitted 2026-06-10 · 💻 cs.CV · cs.AI

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Chaofan Ma , Zhenjie Mao , Yuhuan Yang , Fanqin Zeng , Yue Shi , Yingjie Zhou , Xiaofeng Cao , Jiangchao Yao This is my paper

Pith reviewed 2026-06-27 10:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords spatial reasoningmultimodal large language modelsnovel view synthesisegocentric videoinference-time reasoning3D reconstructionvideo understanding

0 comments

The pith

A two-phase process lets an MLLM form a spatial hypothesis from video then revise it after seeing synthesized complementary views from predicted 3D geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that spatial reasoning in egocentric video is limited by single camera paths and single-turn inference, so models must lean on semantic priors instead of new evidence. It proposes that reasoning should stay revisitable: an initial hypothesis can be checked or corrected once additional viewpoints appear. ReRe implements this with a training-free Reason Phase followed by a Re-reason Phase that feeds the model a novel-view video. The novel views come from a Geometry-to-Video pipeline that turns predicted 3D structure into an elevated, scene-spanning video the model can read natively. On two spatial-reasoning benchmarks the method lifts open-source multimodal models to levels previously reached only by closed proprietary systems.

Core claim

Conclusions drawn from limited video evidence remain open to revision when strategically complementary novel views, rendered from the same predicted 3D geometry, become available; feeding those views back to the same MLLM in a second inference pass produces measurable gains in spatial accuracy without any model retraining or architectural change.

What carries the argument

The ReRe two-phase framework whose Re-reason Phase consumes a Geometry-to-Video output that renders an elevated oblique video spanning the full scene from the initial 3D reconstruction.

If this is right

Open-source MLLMs reach performance parity with proprietary state-of-the-art systems on the tested spatial benchmarks.
Spatial reasoning shifts from reliance on semantic priors toward verification against additional geometric evidence.
The same MLLM can be used for both phases without any fine-tuning or interface changes.
The approach works for any video input that admits a usable 3D reconstruction step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same revisiting logic could be applied to other video tasks where initial evidence is incomplete, such as action anticipation or object permanence checks.
Further gains may depend on improving the upstream 3D reconstruction quality rather than on larger language models.
The method suggests a general inference-time pattern: generate an auxiliary representation, then let the model re-examine its own earlier output against that representation.

Load-bearing premise

The 3D geometry estimated from the original video is accurate enough that the rendered novel views supply genuine new spatial evidence rather than noise or contradictions.

What would settle it

Running the Re-reason Phase on the generated novel-view videos produces no accuracy gain or a net loss on VSI-Bench or STI-Bench relative to the single Reason Phase alone.

Figures

Figures reproduced from arXiv: 2606.11683 by Chaofan Ma, Fanqin Zeng, Jiangchao Yao, Xiaofeng Cao, Yingjie Zhou, Yue Shi, Yuhuan Yang, Zhenjie Mao.

**Figure 1.** Figure 1: ReRe enables the model to revisit its initial hypothesis under a synthesized novel view, correcting spatial reasoning errors that single-turn inference misses. Each case shows the original egocentric video (top frames) and the synthesized novel view (bottom frames), along with the model’s reasoning before (Reason Phase, blue) and after (Re-Reason Phase, red) revisiting. (a) Object Counting: The synthesized… view at source ↗

**Figure 2.** Figure 2: Overview of the ReRe Framework. Given an egocentric video, our method operates in two phases: (1) Reason Phase, where the MLLM forms an initial hypothesis from the original view; and (2) Re-reason Phase, where the model verifies its hypothesis against a synthesized allocentric view (Vexo). The Geometry-to-Video pipeline generates Vexo via trajectory planning and view rendering to provide complementary geo… view at source ↗

**Figure 3.** Figure 3: Overview of the Geometry-to-Video Pipeline. It consists of two stages: (1) Trajectory Planning, where we predict a 3D point cloud via VGGT and design a scene-spanning Oblique Sweep path; and (2) View Rendering, where we synthesize temporally coherent video frames Vexo via point-based rasterization. 3.4.1. RE-REASONING PROTOCOL The key design principle is to enable explicit self-correction: the model must … view at source ↗

**Figure 4.** Figure 4: Visual Comparison of Allocentric Trajectory Designs. (a) Oblique Sweep (Ours) follows a diagonal path through the scene center with an elevated tilt. (b) Mid-level Traverse moves horizontally along the diameter at a fixed elevation. (c) Bird’s-eye Orbit circles the scene center from a top-down perspective. Alternative Trajectories. While our framework is flexible regarding view generation strategies, we ad… view at source ↗

**Figure 5.** Figure 5: Qualitative Results on VSI-BENCH. We visualize how ReRe resolves spatial ambiguities in (a)-(b) Object Counting, (c) Absolute Distance, and (d) Relative Direction. A. Qualitative Results In [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReRe is a training-free two-phase method that lets MLLMs form a spatial guess then revise it on synthesized elevated views from predicted 3D geometry, but the gains rest on unproven 3D accuracy.

read the letter

This paper's core idea is straightforward: run an MLLM once on the original egocentric video to form a spatial hypothesis, then feed it a second video of novel views rendered from monocular 3D predictions so it can verify or correct the first guess. The Geometry-to-Video pipeline produces elevated oblique perspectives with wider coverage while leaving the model's video interface untouched.

The approach is new in its explicit cross-view revisiting setup and the focus on inference-time revision rather than single-turn prompting or retraining. It directly targets the camera-trajectory constraint that forces models to lean on semantic priors. Keeping everything training-free and model-agnostic is a practical strength if the numbers hold.

The main soft spot is the dependence on 3D geometry quality. The abstract describes the pipeline but supplies no evidence that the rendered views remain useful when depth or structure estimates contain typical errors from open-source predictors. Systematic mistakes in the 3D stage would propagate directly into misleading novel views, and nothing shown indicates they tested robustness to that failure mode. The claimed large gains on VSI-Bench and STI-Bench are stated without numbers, baselines, or ablations in the abstract, so the actual effect size and consistency are still unknown.

The work is aimed at researchers applying MLLMs to spatial tasks in robotics, navigation, or AR where camera paths are limited. A reader looking for simple inference-time tricks might extract value once the experiments are visible. The thinking is clear and the problem framing is honest.

I would send it for peer review. Referees can check whether the 3D accuracy issue is addressed in the full experiments and whether the benchmark improvements are reproducible.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes Reason, then Re-reason (ReRe), a training-free inference-time framework for spatial reasoning from egocentric videos in MLLMs. It consists of a Reason Phase forming an initial spatial hypothesis from the input video, followed by a Re-reason Phase that verifies or revises the hypothesis using a synthesized novel-view video. The novel views are generated by a Geometry-to-Video pipeline that renders elevated, oblique perspectives from predicted 3D geometry. The central claim is that this cross-view revisiting substantially improves performance on VSI-Bench and STI-Bench, enabling open-source MLLMs to rival proprietary SOTA without model modifications.

Significance. If the experimental results hold and the pipeline is shown to be robust, the work would be significant as a practical, training-free method for addressing geometric ambiguity in video-based spatial reasoning. The approach preserves the MLLM's native video interface and avoids architectural changes, which are clear strengths. The emphasis on revisitable reasoning grounded in complementary viewpoints offers a falsifiable direction for improving inference-time performance on standard benchmarks.

major comments (3)

[Abstract] Abstract: The claim that ReRe 'substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance' on VSI-Bench and STI-Bench is presented without any quantitative results, specific baselines, ablation studies, or error analysis. This absence is load-bearing because the central claim cannot be evaluated for magnitude, statistical reliability, or comparison to existing methods.
[§3] §3 (Geometry-to-Video pipeline and Re-reason Phase): The effectiveness of the re-reason phase depends on the rendered novel views supplying independent, verifiable spatial evidence rather than artifacts from upstream 3D prediction errors. No analysis, threshold experiments, or robustness tests are provided to show that the pipeline remains useful when monocular depth or structure prediction deviates from ground truth by amounts typical for open-source estimators. This directly undermines the claim that cross-view revisiting improves reasoning.
[Evaluation sections] Evaluation sections: The manuscript references 'extensive evaluations' but supplies no details on the 3D prediction model used, how novel views are selected for complementarity, statistical significance of reported gains, or controls isolating the contribution of the Re-reason Phase versus the original video alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript would benefit from additional quantitative details in the abstract and expanded experimental analyses to better support our claims. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that ReRe 'substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance' on VSI-Bench and STI-Bench is presented without any quantitative results, specific baselines, ablation studies, or error analysis. This absence is load-bearing because the central claim cannot be evaluated for magnitude, statistical reliability, or comparison to existing methods.

Authors: We agree the abstract should provide concrete quantitative support. In the revised version, we will insert specific metrics (e.g., absolute and relative gains on VSI-Bench and STI-Bench versus open-source and proprietary baselines) and a concise reference to key ablations. This directly addresses the evaluability concern without altering the core claim. revision: yes
Referee: [§3] §3 (Geometry-to-Video pipeline and Re-reason Phase): The effectiveness of the re-reason phase depends on the rendered novel views supplying independent, verifiable spatial evidence rather than artifacts from upstream 3D prediction errors. No analysis, threshold experiments, or robustness tests are provided to show that the pipeline remains useful when monocular depth or structure prediction deviates from ground truth by amounts typical for open-source estimators. This directly undermines the claim that cross-view revisiting improves reasoning.

Authors: We acknowledge the importance of demonstrating robustness. We will add a dedicated analysis subsection in §3 containing threshold experiments that inject controlled errors into depth and structure predictions at levels typical of open-source monocular estimators, plus robustness tests across multiple 3D predictors. These will quantify when and how the Re-reason phase remains beneficial. revision: yes
Referee: [Evaluation sections] Evaluation sections: The manuscript references 'extensive evaluations' but supplies no details on the 3D prediction model used, how novel views are selected for complementarity, statistical significance of reported gains, or controls isolating the contribution of the Re-reason Phase versus the original video alone.

Authors: We will expand the evaluation sections to specify the exact 3D model and hyperparameters, detail the complementarity criteria and selection procedure for novel views, report statistical significance (e.g., paired t-tests or bootstrap p-values) for all gains, and include explicit controls that isolate the Re-reason Phase contribution versus the original video. These additions will make the experimental protocol fully reproducible and transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a training-free, inference-time framework (Reason Phase on original video followed by Re-reason Phase on novel views rendered via Geometry-to-Video from predicted 3D geometry) whose performance claims rest on evaluations against external public benchmarks VSI-Bench and STI-Bench. No equations, parameter fits, self-citations, or ansatzes are invoked in a load-bearing way that reduces any claimed result to the inputs by construction; the pipeline components are treated as independent external modules rather than self-referential derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters or invented entities are introduced. The method depends on domain assumptions about MLLM video processing and 3D reconstruction quality.

axioms (2)

domain assumption Multimodal LLMs can natively process synthesized video inputs without architectural modifications
The Re-reason phase and pipeline design assume the MLLM handles the novel-view video directly.
domain assumption Predicted 3D geometry is accurate enough to generate complementary views that resolve spatial ambiguities
The Geometry-to-Video pipeline effectiveness rests on this premise for providing verifiable evidence.

pith-pipeline@v0.9.1-grok · 5764 in / 1428 out tokens · 35458 ms · 2026-06-27T10:39:29.593781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 9 linked inside Pith

[1]

Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Pith/arXiv arXiv 2025
[2]

Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

Pith/arXiv arXiv 2025
[3]

Im- proved visual-spatial reasoning via r1-zero-like train- ing.arXiv preprint arXiv:2504.00883, 2025

Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Im- proved visual-spatial reasoning via r1-zero-like train- ing.arXiv preprint arXiv:2504.00883, 2025

arXiv 2025
[4]

See&trek: Training-free spatial prompting for multimodal large language model.arXiv preprint arXiv:2509.16087, 2025

Pengteng Li, Pinhao Song, Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, and Hui Xiong. See&trek: Training-free spatial prompting for multimodal large language model.arXiv preprint arXiv:2509.16087, 2025

arXiv 2025
[5]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Ri- lyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

2025
[6]

Out of sight, not out of context? egocen- tric spatial reasoning in vlms across disjoint frames

Sahithya Ravi, Gabriel Herbert Sarch, Vibhav Vineet, Andrew D Wilson, and Balasaravanan Thoravi Ku- maravel. Out of sight, not out of context? egocen- tric spatial reasoning in vlms across disjoint frames. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 16146–16161, 2025

2025
[7]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

arXiv 2025
[8]

Video- 3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video- 3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025

2025
[9]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, An- drea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InPro- ceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 5294–5306, 2025

2025
[10]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

Pith/arXiv arXiv 2025
[11]

Beyond flatlands: Unlocking spatial intelligence by decoupling 3d reasoning from numerical regression

Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang, Chenzhi Li, Xinyue Zhang, and Ping Jian. Beyond flatlands: Unlocking spatial intelligence by decoupling 3d reasoning from numerical regression. arXiv preprint arXiv:2511.11239, 2025

arXiv 2025
[12]

Mllms need 3d-aware representation super- vision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation super- vision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

arXiv 2025
[13]

Spatial forcing: Implicit spatial repre- sentation alignment for vision-language-action model

Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial repre- sentation alignment for vision-language-action model. arXiv preprint arXiv:2510.12276, 2025

arXiv 2025
[14]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

2014
[15]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

2023
[16]

What’s” up” with vision-language models? investi- gating their struggle with spatial reasoning.arXiv preprint arXiv:2310.19785, 2023

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s” up” with vision-language models? investi- gating their struggle with spatial reasoning.arXiv preprint arXiv:2310.19785, 2023. 10 Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

arXiv 2023
[17]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 6924– 6934, 2025

2025
[18]

Open-vocabulary semantic segmenta- tion with frozen vision-language models

Chaofan Ma, Yuhuan Yang, Yanfeng Wang, Ya Zhang, and Weidi Xie. Open-vocabulary semantic segmenta- tion with frozen vision-language models. InBritish Machine Vision Conference (BMVC), 2022

2022
[19]

Safire: Saccade-fixation reiteration with mamba for referring image segmentation.Advances in Neural Information Processing Systems, 38:7122–7148, 2026

Zhenjie Mao, Yang Yuhuan, Chaofan Ma, Dongsheng Jiang, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Safire: Saccade-fixation reiteration with mamba for referring image segmentation.Advances in Neural Information Processing Systems, 38:7122–7148, 2026

2026
[20]

ReMamber: Referring image segmentation with mamba twister

Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong, Ya Zhang, and Yanfeng Wang. ReMamber: Referring image segmentation with mamba twister. In European Conference on Computer Vision (ECCV), 2024

2024
[21]

AttrSeg: Open- vocabulary semantic segmentation via attribute decomposition-aggregation

Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Ya Zhang, and Yanfeng Wang. AttrSeg: Open- vocabulary semantic segmentation via attribute decomposition-aggregation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[22]

Multi- modal prototypes for open-world semantic segmenta- tion.International Journal of Computer Vision (IJCV), 2024

Yuhuan Yang, Chaofan Ma, Chen Ju, Fei Zhang, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Multi- modal prototypes for open-world semantic segmenta- tion.International Journal of Computer Vision (IJCV), 2024

2024
[23]

Spa- tialreasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spa- tialreasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

arXiv 2025
[24]

Spare: Enhancing spa- tial reasoning in vision-language models with syn- thetic data.arXiv preprint arXiv:2504.20648, 2025

Michael Ogezi and Freda Shi. Spare: Enhancing spa- tial reasoning in vision-language models with syn- thetic data.arXiv preprint arXiv:2504.20648, 2025

arXiv 2025
[25]

Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14455–14465, 2024

2024
[26]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

2021
[27]

Constraint and union for partially-supervised temporal sentence grounding.arXiv preprint arXiv:2302.09850, 2023

Chen Ju, Haicheng Wang, Jinxiang Liu, Chaofan Ma, Ya Zhang, Peisen Zhao, Jianlong Chang, and Qi Tian. Constraint and union for partially-supervised temporal sentence grounding.arXiv preprint arXiv:2302.09850, 2023

arXiv 2023
[28]

MoMa: Modulat- ing mamba for adapting image foundation models to video recognition

Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. MoMa: Modulat- ing mamba for adapting image foundation models to video recognition. InProceedings of the International Conference on Machine Learning (ICML), 2025

2025
[29]

Contrast-unity for partially-supervised temporal sen- tence grounding

Haicheng Wang, Chen Ju, Weixiong Lin, Chaofan Ma, Shuai Xiao, Ya Zhang, and Yanfeng Wang. Contrast-unity for partially-supervised temporal sen- tence grounding. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

2025
[30]

Spatial understanding from videos: Structured prompts meet simulation data.arXiv preprint arXiv:2506.03642, 2025

Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, and Liqiang Nie. Spatial understanding from videos: Structured prompts meet simulation data.arXiv preprint arXiv:2506.03642, 2025

arXiv 2025
[31]

Spatialprompting: Keyframe- driven zero-shot spatial reasoning with off-the-shelf multimodal large language models.arXiv preprint arXiv:2505.04911, 2025

Shun Taguchi, Hideki Deguchi, Takumi Hamazaki, and Hiroyuki Sakai. Spatialprompting: Keyframe- driven zero-shot spatial reasoning with off-the-shelf multimodal large language models.arXiv preprint arXiv:2505.04911, 2025

arXiv 2025
[32]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

2023
[33]

Geometric granularity aware pixel-to-mesh

Yue Shi, Bingbing Ni, Jinxian Liu, Dingyi Rong, Ye Qian, and Wenjun Zhang. Geometric granularity aware pixel-to-mesh. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13097–13106, October 2021

2021
[34]

Darf: Depth-aware generalizable neural radiance field.Displays, 88: 102996, 2025

Yue Shi, Dingyi Rong, Chang Chen, Chaofan Ma, Bingbing Ni, and Wenjun Zhang. Darf: Depth-aware generalizable neural radiance field.Displays, 88: 102996, 2025

2025
[35]

Mipmap-gs: Let gaussians deform with scale-specific mipmap for anti-aliasing rendering

Jiameng Li, Yue Shi, Jiezhang Cao, Bingbing Ni, Wen- jun Zhang, Kai Zhang, and Luc Van Gool. Mipmap-gs: Let gaussians deform with scale-specific mipmap for anti-aliasing rendering. In2025 International Confer- ence on 3D Vision (3DV), 2025

2025
[36]

Tianhao Peng, Haochen Wang, Yuanxing Zhang, Zekun Wang, Zili Wang, Gavin Chang, Jian Yang, Shihao Li, Yanghai Wang, Xintao Wang, et al. Mvu- eval: Towards multi-video understanding evaluation 11 Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning for multimodal llms.arXiv preprint arXiv:2511.07250, 2025

arXiv 2025
[37]

Sti-bench: Are mllms ready for precise spatial-temporal world under- standing?arXiv preprint arXiv:2503.23765, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenx- iao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world under- standing?arXiv preprint arXiv:2503.23765, 2025

arXiv 2025
[38]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[39]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

Pith/arXiv arXiv 2025
[40]

Expanding perfor- mance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding perfor- mance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

Pith/arXiv arXiv 2024
[41]

Internvl3: Exploring ad- vanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring ad- vanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

Pith/arXiv arXiv 2025
[42]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Bur- nell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

Pith/arXiv arXiv 2024
[43]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[44]

Spatialladder: Progressive training for spatial reasoning in vision- language models.arXiv preprint arXiv:2510.08531, 2025

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, et al. Spatialladder: Progressive training for spatial reasoning in vision- language models.arXiv preprint arXiv:2510.08531, 2025

arXiv 2025
[45]

FastVGGT: Training-free acceleration of visual ge- ometry transformer

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. FastVGGT: Training-free acceleration of visual ge- ometry transformer. InInternational Conference on Learning Representations (ICLR), 2026

2026
[46]

LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merg- ing.arXiv preprint arXiv:2512.04939, 2025

Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, and Xiao-Xiao Long. LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merg- ing.arXiv preprint arXiv:2512.04939, 2025

arXiv 2025
[47]

Deep extreme cut: From extreme points to object segmentation

Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont- Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object segmentation. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 616–625, 2018

2018
[48]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4015–4026, 2023

2023
[49]

Chaofan Ma, Qisen Xu, Xiangfeng Wang, Bo Jin, Xiaoyun Zhang, Yanfeng Wang, and Ya Zhang. Boundary-aware supervoxel-level iteratively refined interactive 3D image segmentation with multi-agent re- inforcement learning.IEEE Transactions on Medical Imaging, 40(10):2563–2574, 2021

2021
[50]

Transforming the interactive segmen- tation for medical imaging

Wentao Liu, Chaofan Ma, Yuhuan Yang, Weidi Xie, and Ya Zhang. Transforming the interactive segmen- tation for medical imaging. InMedical Image Com- puting and Computer Assisted Intervention (MICCAI), 2022

2022
[51]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InProceed- ings of the European Conference on Computer Vision (ECCV), 2024. 12 Reason, Then Re-reason: Cross-view Revisiting Im...

2024
[52]

Annotation-free audio- visual segmentation

Jinxiang Liu, Yu Wang, Chen Ju, Chaofan Ma, Ya Zhang, and Weidi Xie. Annotation-free audio- visual segmentation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

2024
[53]

Unsupervised domain adaptation via similarity-based prototypes for cross-modality segmentation

Ziyu Ye, Chen Ju, Chaofan Ma, and Xiaoyun Zhang. Unsupervised domain adaptation via similarity-based prototypes for cross-modality segmentation. InMed- ical Image Computing and Computer Assisted Inter- vention (MICCAI), 2021

2021
[54]

Audio-aware query-enhanced transformer for audio-visual segmentation.arXiv preprint arXiv:2307.13236, 2023

Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, and Ya Zhang. Audio-aware query-enhanced transformer for audio-visual segmentation.arXiv preprint arXiv:2307.13236, 2023

arXiv 2023
[55]

Open-vocabulary panoptic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xi- aolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2955–2966, 2023

2023
[56]

Dif- fusionSeg: Adapting diffusion towards unsupervised object discovery.arXiv preprint arXiv:2303.09813, 2023

Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Jinxi- ang Liu, Yu Wang, Ya Zhang, and Yanfeng Wang. Dif- fusionSeg: Adapting diffusion towards unsupervised object discovery.arXiv preprint arXiv:2303.09813, 2023

arXiv 2023
[57]

GenMask: Adapting DiT for segmentation via direct mask generation

Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, and Yanfeng Wang. GenMask: Adapting DiT for segmentation via direct mask generation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026
[58]

Tracking the rareness of diseases: Improving long-tail medical detection with a calibrated diffusion model

Tianjiao Zhang, Chaofan Ma, and Yanfeng Wang. Tracking the rareness of diseases: Improving long-tail medical detection with a calibrated diffusion model. Electronics, 13(23):4693, 2024

2024
[59]

FreeSegDiff: Annotation-free saliency segmentation with diffusion models

Chaofan Ma, Yuhuan Yang, Chen Ju, Yue Shi, Ya Zhang, and Yanfeng Wang. FreeSegDiff: Annotation-free saliency segmentation with diffusion models. InIEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025

2025
[60]

UniReason 1.0: A unified reasoning framework for world knowledge aligned image generation and editing

Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, and Jiaqi Wang. UniReason 1.0: A unified reasoning framework for world knowledge aligned image generation and editing. arXiv preprint arXiv:2602.02437, 2026

arXiv 2026
[61]

Inter- leaving reasoning for better text-to-image generation

Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, and Shaohui Lin. Inter- leaving reasoning for better text-to-image generation. arXiv preprint arXiv:2509.06945, 2025

arXiv 2025
[62]

Uni-CoT: Towards unified chain- of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606, 2025

Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-CoT: Towards unified chain- of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606, 2025

arXiv 2025
[63]

phase2 output :

Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tian- hang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, and Jiaqi Wang. DeepGen 1.0: A lightweight unified multimodal model for advancing image gener- ation and editing.a...

arXiv 2026
[64]

**Observe** the video carefully and describe the key visual elements
[65]

**Infer** a plausible answer even if visual information is incomplete
[66]

Engage in an internal dialogue using expressions such as ’let me think’, ’wait’, ’Hmm’, ’oh, I see’, ’let’s break it down’, etc, or other natural language thought expressions

**Conclude** with a final answer. Engage in an internal dialogue using expressions such as ’let me think’, ’wait’, ’Hmm’, ’oh, I see’, ’let’s break it down’, etc, or other natural language thought expressions. It’s encouraged to include self-reflection or verification in the reasoning process. Provide your detailed reasoning between the <think> and </thin...
[67]

new views

**Compare** old vs. new views
[68]

If the question relates to temporal order, primarily maintain your answer from the first round

**Reflect** on whether your prior conclusion holds. If the question relates to temporal order, primarily maintain your answer from the first round
[69]

Follow this format strictly: <think>put your Step-by-step reasoning process here</think> <answer>put your specific answer here</answer> Rules: - Do not output text outside tags

**Confirm** your final answer. Follow this format strictly: <think>put your Step-by-step reasoning process here</think> <answer>put your specific answer here</answer> Rules: - Do not output text outside tags. - Only one final answer. - Avoid vague terms like ’around’ or ’approximately’. {Answer Format Constraint} Let’s think step by step about the compari...

[1] [1]

Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Pith/arXiv arXiv 2025

[2] [2]

Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

Pith/arXiv arXiv 2025

[3] [3]

Im- proved visual-spatial reasoning via r1-zero-like train- ing.arXiv preprint arXiv:2504.00883, 2025

Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Im- proved visual-spatial reasoning via r1-zero-like train- ing.arXiv preprint arXiv:2504.00883, 2025

arXiv 2025

[4] [4]

See&trek: Training-free spatial prompting for multimodal large language model.arXiv preprint arXiv:2509.16087, 2025

Pengteng Li, Pinhao Song, Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, and Hui Xiong. See&trek: Training-free spatial prompting for multimodal large language model.arXiv preprint arXiv:2509.16087, 2025

arXiv 2025

[5] [5]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Ri- lyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

2025

[6] [6]

Out of sight, not out of context? egocen- tric spatial reasoning in vlms across disjoint frames

Sahithya Ravi, Gabriel Herbert Sarch, Vibhav Vineet, Andrew D Wilson, and Balasaravanan Thoravi Ku- maravel. Out of sight, not out of context? egocen- tric spatial reasoning in vlms across disjoint frames. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 16146–16161, 2025

2025

[7] [7]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

arXiv 2025

[8] [8]

Video- 3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video- 3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025

2025

[9] [9]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, An- drea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InPro- ceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 5294–5306, 2025

2025

[10] [10]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

Pith/arXiv arXiv 2025

[11] [11]

Beyond flatlands: Unlocking spatial intelligence by decoupling 3d reasoning from numerical regression

Zhongbin Guo, Jiahe Liu, Yushan Li, Wenyu Gao, Zhen Yang, Chenzhi Li, Xinyue Zhang, and Ping Jian. Beyond flatlands: Unlocking spatial intelligence by decoupling 3d reasoning from numerical regression. arXiv preprint arXiv:2511.11239, 2025

arXiv 2025

[12] [12]

Mllms need 3d-aware representation super- vision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation super- vision for scene understanding.arXiv preprint arXiv:2506.01946, 2025

arXiv 2025

[13] [13]

Spatial forcing: Implicit spatial repre- sentation alignment for vision-language-action model

Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial repre- sentation alignment for vision-language-action model. arXiv preprint arXiv:2510.12276, 2025

arXiv 2025

[14] [14]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

2014

[15] [15]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

2023

[16] [16]

What’s” up” with vision-language models? investi- gating their struggle with spatial reasoning.arXiv preprint arXiv:2310.19785, 2023

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s” up” with vision-language models? investi- gating their struggle with spatial reasoning.arXiv preprint arXiv:2310.19785, 2023. 10 Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

arXiv 2023

[17] [17]

3dsrbench: A comprehensive 3d spatial reasoning benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 6924– 6934, 2025

2025

[18] [18]

Open-vocabulary semantic segmenta- tion with frozen vision-language models

Chaofan Ma, Yuhuan Yang, Yanfeng Wang, Ya Zhang, and Weidi Xie. Open-vocabulary semantic segmenta- tion with frozen vision-language models. InBritish Machine Vision Conference (BMVC), 2022

2022

[19] [19]

Safire: Saccade-fixation reiteration with mamba for referring image segmentation.Advances in Neural Information Processing Systems, 38:7122–7148, 2026

Zhenjie Mao, Yang Yuhuan, Chaofan Ma, Dongsheng Jiang, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Safire: Saccade-fixation reiteration with mamba for referring image segmentation.Advances in Neural Information Processing Systems, 38:7122–7148, 2026

2026

[20] [20]

ReMamber: Referring image segmentation with mamba twister

Yuhuan Yang, Chaofan Ma, Jiangchao Yao, Zhun Zhong, Ya Zhang, and Yanfeng Wang. ReMamber: Referring image segmentation with mamba twister. In European Conference on Computer Vision (ECCV), 2024

2024

[21] [21]

AttrSeg: Open- vocabulary semantic segmentation via attribute decomposition-aggregation

Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Ya Zhang, and Yanfeng Wang. AttrSeg: Open- vocabulary semantic segmentation via attribute decomposition-aggregation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[22] [22]

Multi- modal prototypes for open-world semantic segmenta- tion.International Journal of Computer Vision (IJCV), 2024

Yuhuan Yang, Chaofan Ma, Chen Ju, Fei Zhang, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Multi- modal prototypes for open-world semantic segmenta- tion.International Journal of Computer Vision (IJCV), 2024

2024

[23] [23]

Spa- tialreasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso de Melo, Jianwen Xie, and Alan Yuille. Spa- tialreasoner: Towards explicit and generalizable 3d spatial reasoning.arXiv preprint arXiv:2504.20024, 2025

arXiv 2025

[24] [24]

Spare: Enhancing spa- tial reasoning in vision-language models with syn- thetic data.arXiv preprint arXiv:2504.20648, 2025

Michael Ogezi and Freda Shi. Spare: Enhancing spa- tial reasoning in vision-language models with syn- thetic data.arXiv preprint arXiv:2504.20648, 2025

arXiv 2025

[25] [25]

Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spa- tial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14455–14465, 2024

2024

[26] [26]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

2021

[27] [27]

Constraint and union for partially-supervised temporal sentence grounding.arXiv preprint arXiv:2302.09850, 2023

Chen Ju, Haicheng Wang, Jinxiang Liu, Chaofan Ma, Ya Zhang, Peisen Zhao, Jianlong Chang, and Qi Tian. Constraint and union for partially-supervised temporal sentence grounding.arXiv preprint arXiv:2302.09850, 2023

arXiv 2023

[28] [28]

MoMa: Modulat- ing mamba for adapting image foundation models to video recognition

Yuhuan Yang, Chaofan Ma, Zhenjie Mao, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. MoMa: Modulat- ing mamba for adapting image foundation models to video recognition. InProceedings of the International Conference on Machine Learning (ICML), 2025

2025

[29] [29]

Contrast-unity for partially-supervised temporal sen- tence grounding

Haicheng Wang, Chen Ju, Weixiong Lin, Chaofan Ma, Shuai Xiao, Ya Zhang, and Yanfeng Wang. Contrast-unity for partially-supervised temporal sen- tence grounding. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

2025

[30] [30]

Spatial understanding from videos: Structured prompts meet simulation data.arXiv preprint arXiv:2506.03642, 2025

Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, and Liqiang Nie. Spatial understanding from videos: Structured prompts meet simulation data.arXiv preprint arXiv:2506.03642, 2025

arXiv 2025

[31] [31]

Spatialprompting: Keyframe- driven zero-shot spatial reasoning with off-the-shelf multimodal large language models.arXiv preprint arXiv:2505.04911, 2025

Shun Taguchi, Hideki Deguchi, Takumi Hamazaki, and Hiroyuki Sakai. Spatialprompting: Keyframe- driven zero-shot spatial reasoning with off-the-shelf multimodal large language models.arXiv preprint arXiv:2505.04911, 2025

arXiv 2025

[32] [32]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

2023

[33] [33]

Geometric granularity aware pixel-to-mesh

Yue Shi, Bingbing Ni, Jinxian Liu, Dingyi Rong, Ye Qian, and Wenjun Zhang. Geometric granularity aware pixel-to-mesh. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13097–13106, October 2021

2021

[34] [34]

Darf: Depth-aware generalizable neural radiance field.Displays, 88: 102996, 2025

Yue Shi, Dingyi Rong, Chang Chen, Chaofan Ma, Bingbing Ni, and Wenjun Zhang. Darf: Depth-aware generalizable neural radiance field.Displays, 88: 102996, 2025

2025

[35] [35]

Mipmap-gs: Let gaussians deform with scale-specific mipmap for anti-aliasing rendering

Jiameng Li, Yue Shi, Jiezhang Cao, Bingbing Ni, Wen- jun Zhang, Kai Zhang, and Luc Van Gool. Mipmap-gs: Let gaussians deform with scale-specific mipmap for anti-aliasing rendering. In2025 International Confer- ence on 3D Vision (3DV), 2025

2025

[36] [36]

Tianhao Peng, Haochen Wang, Yuanxing Zhang, Zekun Wang, Zili Wang, Gavin Chang, Jian Yang, Shihao Li, Yanghai Wang, Xintao Wang, et al. Mvu- eval: Towards multi-video understanding evaluation 11 Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning for multimodal llms.arXiv preprint arXiv:2511.07250, 2025

arXiv 2025

[37] [37]

Sti-bench: Are mllms ready for precise spatial-temporal world under- standing?arXiv preprint arXiv:2503.23765, 2025

Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenx- iao Cai, Zheng Liu, and Bo Zhao. Sti-bench: Are mllms ready for precise spatial-temporal world under- standing?arXiv preprint arXiv:2503.23765, 2025

arXiv 2025

[38] [38]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[39] [39]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

Pith/arXiv arXiv 2025

[40] [40]

Expanding perfor- mance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding perfor- mance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

Pith/arXiv arXiv 2024

[41] [41]

Internvl3: Exploring ad- vanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring ad- vanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

Pith/arXiv arXiv 2025

[42] [42]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Bur- nell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

Pith/arXiv arXiv 2024

[43] [43]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[44] [44]

Spatialladder: Progressive training for spatial reasoning in vision- language models.arXiv preprint arXiv:2510.08531, 2025

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, et al. Spatialladder: Progressive training for spatial reasoning in vision- language models.arXiv preprint arXiv:2510.08531, 2025

arXiv 2025

[45] [45]

FastVGGT: Training-free acceleration of visual ge- ometry transformer

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. FastVGGT: Training-free acceleration of visual ge- ometry transformer. InInternational Conference on Learning Representations (ICLR), 2026

2026

[46] [46]

LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merg- ing.arXiv preprint arXiv:2512.04939, 2025

Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, and Xiao-Xiao Long. LiteVGGT: Boosting vanilla VGGT via geometry-aware cached token merg- ing.arXiv preprint arXiv:2512.04939, 2025

arXiv 2025

[47] [47]

Deep extreme cut: From extreme points to object segmentation

Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont- Tuset, and Luc Van Gool. Deep extreme cut: From extreme points to object segmentation. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 616–625, 2018

2018

[48] [48]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, Piotr Doll´ar, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4015–4026, 2023

2023

[49] [49]

Chaofan Ma, Qisen Xu, Xiangfeng Wang, Bo Jin, Xiaoyun Zhang, Yanfeng Wang, and Ya Zhang. Boundary-aware supervoxel-level iteratively refined interactive 3D image segmentation with multi-agent re- inforcement learning.IEEE Transactions on Medical Imaging, 40(10):2563–2574, 2021

2021

[50] [50]

Transforming the interactive segmen- tation for medical imaging

Wentao Liu, Chaofan Ma, Yuhuan Yang, Weidi Xie, and Ya Zhang. Transforming the interactive segmen- tation for medical imaging. InMedical Image Com- puting and Computer Assisted Intervention (MICCAI), 2022

2022

[51] [51]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. InProceed- ings of the European Conference on Computer Vision (ECCV), 2024. 12 Reason, Then Re-reason: Cross-view Revisiting Im...

2024

[52] [52]

Annotation-free audio- visual segmentation

Jinxiang Liu, Yu Wang, Chen Ju, Chaofan Ma, Ya Zhang, and Weidi Xie. Annotation-free audio- visual segmentation. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024

2024

[53] [53]

Unsupervised domain adaptation via similarity-based prototypes for cross-modality segmentation

Ziyu Ye, Chen Ju, Chaofan Ma, and Xiaoyun Zhang. Unsupervised domain adaptation via similarity-based prototypes for cross-modality segmentation. InMed- ical Image Computing and Computer Assisted Inter- vention (MICCAI), 2021

2021

[54] [54]

Audio-aware query-enhanced transformer for audio-visual segmentation.arXiv preprint arXiv:2307.13236, 2023

Jinxiang Liu, Chen Ju, Chaofan Ma, Yanfeng Wang, Yu Wang, and Ya Zhang. Audio-aware query-enhanced transformer for audio-visual segmentation.arXiv preprint arXiv:2307.13236, 2023

arXiv 2023

[55] [55]

Open-vocabulary panoptic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xi- aolong Wang, and Shalini De Mello. Open-vocabulary panoptic segmentation with text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2955–2966, 2023

2023

[56] [56]

Dif- fusionSeg: Adapting diffusion towards unsupervised object discovery.arXiv preprint arXiv:2303.09813, 2023

Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Jinxi- ang Liu, Yu Wang, Ya Zhang, and Yanfeng Wang. Dif- fusionSeg: Adapting diffusion towards unsupervised object discovery.arXiv preprint arXiv:2303.09813, 2023

arXiv 2023

[57] [57]

GenMask: Adapting DiT for segmentation via direct mask generation

Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, and Yanfeng Wang. GenMask: Adapting DiT for segmentation via direct mask generation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

2026

[58] [58]

Tracking the rareness of diseases: Improving long-tail medical detection with a calibrated diffusion model

Tianjiao Zhang, Chaofan Ma, and Yanfeng Wang. Tracking the rareness of diseases: Improving long-tail medical detection with a calibrated diffusion model. Electronics, 13(23):4693, 2024

2024

[59] [59]

FreeSegDiff: Annotation-free saliency segmentation with diffusion models

Chaofan Ma, Yuhuan Yang, Chen Ju, Yue Shi, Ya Zhang, and Yanfeng Wang. FreeSegDiff: Annotation-free saliency segmentation with diffusion models. InIEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025

2025

[60] [60]

UniReason 1.0: A unified reasoning framework for world knowledge aligned image generation and editing

Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, and Jiaqi Wang. UniReason 1.0: A unified reasoning framework for world knowledge aligned image generation and editing. arXiv preprint arXiv:2602.02437, 2026

arXiv 2026

[61] [61]

Inter- leaving reasoning for better text-to-image generation

Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, and Shaohui Lin. Inter- leaving reasoning for better text-to-image generation. arXiv preprint arXiv:2509.06945, 2025

arXiv 2025

[62] [62]

Uni-CoT: Towards unified chain- of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606, 2025

Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-CoT: Towards unified chain- of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606, 2025

arXiv 2025

[63] [63]

phase2 output :

Dianyi Wang, Ruihang Li, Feng Han, Chaofan Ma, Wei Song, Siyuan Wang, Yibin Wang, Yi Xin, Hongjian Liu, Zhixiong Zhang, Shengyuan Ding, Tian- hang Wang, Zhenglin Cheng, Tao Lin, Cheng Jin, Kaicheng Yu, Jingjing Chen, Wenjie Wang, Zhongyu Wei, and Jiaqi Wang. DeepGen 1.0: A lightweight unified multimodal model for advancing image gener- ation and editing.a...

arXiv 2026

[64] [64]

**Observe** the video carefully and describe the key visual elements

[65] [65]

**Infer** a plausible answer even if visual information is incomplete

[66] [66]

Engage in an internal dialogue using expressions such as ’let me think’, ’wait’, ’Hmm’, ’oh, I see’, ’let’s break it down’, etc, or other natural language thought expressions

**Conclude** with a final answer. Engage in an internal dialogue using expressions such as ’let me think’, ’wait’, ’Hmm’, ’oh, I see’, ’let’s break it down’, etc, or other natural language thought expressions. It’s encouraged to include self-reflection or verification in the reasoning process. Provide your detailed reasoning between the <think> and </thin...

[67] [67]

new views

**Compare** old vs. new views

[68] [68]

If the question relates to temporal order, primarily maintain your answer from the first round

**Reflect** on whether your prior conclusion holds. If the question relates to temporal order, primarily maintain your answer from the first round

[69] [69]

Follow this format strictly: <think>put your Step-by-step reasoning process here</think> <answer>put your specific answer here</answer> Rules: - Do not output text outside tags

**Confirm** your final answer. Follow this format strictly: <think>put your Step-by-step reasoning process here</think> <answer>put your specific answer here</answer> Rules: - Do not output text outside tags. - Only one final answer. - Avoid vague terms like ’around’ or ’approximately’. {Answer Format Constraint} Let’s think step by step about the compari...