CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

arxiv: 2605.20165 · v1 · pith:M4SC5G5Dnew · submitted 2026-05-19 · 💻 cs.CV

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

Hsiang-Wei Huang , Junbin Lu , Kuang-Ming Chen , Jianxu Shangguan , Cheng-Yen Yang , Jenq-Neng Hwang This is my paper

Pith reviewed 2026-05-20 05:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsspatial understandingcamera motionevaluation frameworkspatial narratives3D spatial intelligence

0 comments p. Extension

pith:M4SC5G5D Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{M4SC5G5D}

Prints a linked pith:M4SC5G5D badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Vision-language models that score well on spatial questions still lack understanding of camera motion until trained to produce explicit scene and motion narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that strong results on spatial question answering benchmarks do not prove genuine 3D spatial intelligence in vision-language models, since these models routinely overlook how the camera is moving through a scene. To expose the gap, the authors create the Spatial Narrative Score, which asks a model to first write out a full description of the scene contents and the camera's path before a separate frozen language model uses that description to answer questions. Leading models suffer large drops under this requirement even when they handle direct questions accurately. The authors then train CaMo, a vision-language model explicitly grounded in camera motion, and show it maintains steady performance on both the new narrative test and standard spatial questions.

Core claim

State-of-the-art spatial vision-language models exhibit significant performance degradation under the Spatial Narrative Score despite high direct question answering accuracy. CaMo, a camera motion grounded VLM, achieves consistent performance across SNS evaluation and direct spatial question answering accuracy, showing that explicit spatial narrative externalization supports transferable 3D spatial understanding.

What carries the argument

The Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM.

If this is right

Direct spatial question answering accuracy alone is insufficient evidence of genuine 3D spatial intelligence in VLMs.
Explicit training on camera motion produces more consistent results across different ways of probing spatial understanding.
Externalizing spatial narratives makes it possible to separate and measure the quality of a model's internal scene representation.
Camera-motion grounding can be added to existing VLM training pipelines without harming performance on conventional spatial QA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same narrative-externalization approach could be adapted to evaluate understanding of object dynamics or multi-view consistency in video sequences.
If the consistency gains hold, CaMo-style training might improve downstream performance in robotics tasks that require predicting how scenes change under known camera paths.
Using a frozen proxy LLM after narrative generation may help isolate whether gains come from better visual grounding rather than improved language reasoning.

Load-bearing premise

That forcing a model to output explicit spatial narratives including camera motion and then routing those narratives to a separate frozen language model for reasoning truly measures transferable 3D spatial understanding rather than just the ability to produce good descriptions.

What would settle it

A test in which CaMo-trained models show no advantage over baseline models on spatial reasoning tasks that provide no camera-motion information and require no narrative generation.

Figures

Figures reproduced from arXiv: 2605.20165 by Cheng-Yen Yang, Hsiang-Wei Huang, Jenq-Neng Hwang, Jianxu Shangguan, Junbin Lu, Kuang-Ming Chen.

**Figure 1.** Figure 1: Fine-tuning on spatial QA data improves VLM’s (Li et al., 2025b) spatial question answering, but does not translate to better camera motion understanding compared to base model. reinforcement learning method such as Group Relative Policy Optimization (GRPO) (Ouyang et al., 2025; Li et al., 2025b). Notably, the introduction of GRPO (Shao et al., 2024; Guo et al., 2025) has emerged as a dominant strategy an… view at source ↗

**Figure 2.** Figure 2: Left. CaMo-30K features detailed video semantic and camera motion caption in a structured format. Right. Our data composes a mixture of image, multi-view, and video spatial understanding QA pairs and our spatial narrative data, with a total of 30K samples. 3. CaMo-30K Dataset 3.1. Dataset Construction Inspired by human spatial cognition, which relies on the integration of camera motion and visual perceptio… view at source ↗

**Figure 3.** Figure 3: Pipeline of Spatial Narrative Score Evaluation. The input video segment will be sent to the VLM to generate dense video spatial narrative, which is sent to an proxy LLM to generate prediction for accuracy evaluation. 4. Spatial Narrative Score Evaluation 4.1. Shortcut Learning in Spatial Understanding VLMs Many recent works demonstrate spatial QA-finetuned VLM can achieve substantial accuracy improvements … view at source ↗

**Figure 4.** Figure 4: Gap Between Direct MCQ Accuracy and SNS. SpatialLadder exhibits a substantial performance drop across all question types under SNS evaluation, whereas CaMo maintains more consistent or even improved performance in all question types. can correctly answer complex spatial questions while failing to describe the underlying camera motion that is essential for constructing a coherent global spatial representa… view at source ↗

**Figure 5.** Figure 5: Qualitative Results from VSI-Bench. Compared with the coarse camera motion (highlighted underline) from SpatialLadder, CaMo generates more fine-grained camera motion (highlighted bold), which contains richer spatial information for the SNS evaluation. the LLM to maintain its performance on question answering. Segment Length [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Video example from CaMo-30K [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Video example from CaMo-30K [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Video example from CaMo-30K. A.4. Dataset Examples We showcase several video examples and corresponding spatial narrative annotations from CaMo-30K in [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Word cloud of CaMo-3B [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 12.** Figure 12: Spatial narrative video sample from CaMo [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Spatial narrative video sample from CaMo [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Spatial narrative video sample from CaMo [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Spatial narrative video sample from CaMo [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Failure case video due to VLM’s limitation in generating spatial narrative [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Failure case video with ambiguous and inaccurate ground truth. F. Failure Case Analysis We conduct extensive study on our failure cases, where we found most of the failure cases are caused by the VLM’s inherent limitations and some of the ambiguous ground truth annotations. We discuss these two cases as follows. F.1. VLM Limitation In some failure cases, the VLM fails to generate accurate spatial narrativ… view at source ↗

read the original abstract

Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SNS shows VLMs can ace direct spatial QA while struggling when forced to externalize camera motion in narratives first, and CaMo trains for consistency, but the proxy-based scoring risks measuring narrative fit more than intrinsic 3D grasp.

read the letter

The main thing here is that standard spatial QA benchmarks may overestimate how well VLMs track camera motion, and this paper uses an explicit narrative step plus a frozen proxy LLM to surface the gap before proposing CaMo as a training fix that keeps scores steady across both paths. The new contribution is the Spatial Narrative Score setup itself, which requires the VLM to output scene semantics and camera movement descriptions that the proxy then reasons over. They report clear drops for existing models under SNS despite strong direct QA results, and CaMo closes that gap while releasing code, data, and the model. That release is practical for anyone wanting to test or extend the idea. The work does a decent job highlighting why externalizing spatial relations could matter for transferable understanding in areas like robotics or AR. The softer part is the load-bearing assumption that proxy reasoning from the narrative accurately reflects the VLM's internal spatial model. Degradation could stem from the VLM producing narratives the proxy parses poorly even when the underlying 3D tracking is intact, and CaMo's gains might come from aligning output style to this specific proxy rather than building general motion understanding. The abstract does not detail ablations with alternate proxies, independent narrative accuracy checks, or error breakdowns that would separate those possibilities. This is aimed at researchers building or benchmarking spatial VLMs who want a stricter test than direct QA alone. A reader focused on evaluation methods or 3D-aware multimodal work would find usable ideas here. It deserves peer review because the core framing around camera motion grounding is worth testing and tightening, even with the current measurement questions.

Referee Report

1 major / 1 minor

Summary. The paper claims that state-of-the-art vision-language models achieve high accuracy on direct spatial question answering but exhibit significant performance degradation under the proposed Spatial Narrative Score (SNS) evaluation framework. SNS requires VLMs to generate explicit spatial narratives capturing scene semantics and camera motion, which are then reasoned over by a frozen proxy LLM. The authors introduce CaMo, a camera motion grounded VLM, which achieves consistent performance across SNS and direct QA, arguing this demonstrates the importance of explicit spatial narrative externalization for transferable 3D spatial understanding.

Significance. If SNS is shown to isolate genuine camera motion understanding rather than narrative generation quality, the work would be significant for VLM evaluation and training in computer vision. The public availability of code, data, and model is a clear strength supporting reproducibility.

major comments (1)

[Abstract and SNS Evaluation Framework] The headline claim (Abstract) that SNS degradation demonstrates VLMs lack camera motion understanding is load-bearing on the assumption that the proxy LLM reasoning step accurately measures the VLM's internal spatial cognition. Degradation could instead arise solely from poor narrative generation that the proxy cannot parse effectively. No ablation isolating narrative externalization quality from downstream reasoning is described, which directly undermines the contrast with direct QA accuracy and the interpretation of CaMo's consistent scores.

minor comments (1)

[Methods] Clarify the exact architecture and training details of CaMo in the methods section to allow replication of the camera motion grounding component.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on the SNS evaluation framework below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses

Referee: [Abstract and SNS Evaluation Framework] The headline claim (Abstract) that SNS degradation demonstrates VLMs lack camera motion understanding is load-bearing on the assumption that the proxy LLM reasoning step accurately measures the VLM's internal spatial cognition. Degradation could instead arise solely from poor narrative generation that the proxy cannot parse effectively. No ablation isolating narrative externalization quality from downstream reasoning is described, which directly undermines the contrast with direct QA accuracy and the interpretation of CaMo's consistent scores.

Authors: We agree that an explicit ablation isolating narrative generation quality from the proxy's downstream reasoning would strengthen the interpretation and directly address potential confounds. The SNS framework is designed to test whether VLMs can externalize spatial understanding (including camera motion) into coherent narratives that support subsequent reasoning by a separate module; the observed degradation relative to direct QA is intended to highlight limitations in this externalization capability rather than in the proxy itself. We selected a strong, frozen general-purpose LLM as the proxy precisely to reduce parsing failures and focus on the quality of the VLM-generated narratives (details in Section 4). To resolve the concern, we will add an ablation study in the revised manuscript that independently evaluates narrative quality (via human ratings and alternative reasoning models) and compares SNS performance under controlled conditions. This will clarify the source of the performance gap and better support the interpretation of CaMo's consistent results across both evaluation modes as evidence for improved camera-motion-grounded externalization. revision: yes

Circularity Check

0 steps flagged

No circularity: SNS evaluation and CaMo training rest on empirical comparison to direct QA

full rationale

The paper defines SNS as an external evaluation procedure that elicits spatial narratives from the target VLM and routes them through a separate frozen proxy LLM for reasoning; performance degradation is then reported as an empirical observation against direct QA baselines. CaMo is introduced as a training method that improves consistency on both metrics. No equations, fitted parameters, or self-citations are shown to reduce the central claims to definitional equivalence or to the inputs by construction; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; details of training procedure, hyperparameters, and exact implementation not available. Relies on standard assumptions about VLM fine-tuning and proxy LLM reliability for scoring.

axioms (1)

domain assumption A frozen proxy LLM can reliably evaluate the quality of spatial narratives generated by the target VLM
Central to the SNS scoring process described in abstract.

pith-pipeline@v0.9.0 · 5710 in / 1168 out tokens · 48213 ms · 2026-05-20T05:23:16.688853+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CaMo-3B ... achieves consistent performance across SNS evaluation and direct spatial question answering accuracy.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 32 internal anchors

[1]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

ARKitScenes: A diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data , author=. arXiv preprint arXiv:2111.08897 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Omni3D: A large benchmark and model for 3D object detection in the wild , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[4]

Advances in Neural Information Processing Systems , volume=

HourVideo: 1-hour video-language understanding , author=. Advances in Neural Information Processing Systems , volume=

work page
[5]

2025 , url=

Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models , author=. 2025 , url=

work page 2025
[6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

ScanNet: Richly-annotated 3D reconstructions of indoor scenes , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[8]

GRIT: Teaching MLLMs to Think with Images

Grit: Teaching MLLMs to think with images , author=. arXiv preprint arXiv:2505.15879 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Video-R1: Reinforcing Video Reasoning in MLLMs

Video-r1: Reinforcing video reasoning in mllms , author=. arXiv preprint arXiv:2503.21776 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Advances in Neural Information Processing Systems , volume=

3D-LLM: Injecting the 3D world into large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Vision-R1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2310.19785 , year=

What's "up" with vision-language models? Investigating their struggle with spatial reasoning , author=. arXiv preprint arXiv:2310.19785 , year=

work page arXiv
[15]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. arXiv preprint arXiv:2302.09664 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Kullback-Leibler divergence , author=

work page
[17]

Transactions on Machine Learning Research , year=

LLaVA-OneVision: Easy Visual Task Transfer , author=. Transactions on Machine Learning Research , year=

work page
[18]

arXiv preprint arXiv:2505.21500 , year=

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models , author=. arXiv preprint arXiv:2505.21500 , year=

work page arXiv
[19]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models , author=. arXiv preprint arXiv:2407.07895 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

VideoChat-R1: Enhancing spatio-temporal perception via reinforcement fine-tuning , author=. arXiv preprint arXiv:2504.06958 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2503.23765 , year=

STI-Bench: Are MLLMs ready for precise spatial-temporal world understanding? , author=. arXiv preprint arXiv:2503.23765 , year=

work page arXiv
[22]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Self-rewarding vision-language model via reasoning decomposition , author=. arXiv preprint arXiv:2508.19652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

arXiv preprint arXiv:2504.00883 , year=

Improved visual-spatial reasoning via r1-zero-like training , author=. arXiv preprint arXiv:2504.00883 , year=

work page arXiv
[24]

European Conference on Computer Vision , pages=

Microsoft COCO: Common objects in context , author=. European Conference on Computer Vision , pages=. 2014 , organization=

work page 2014
[25]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Coarse correspondences boost spatial-temporal reasoning in multimodal language model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[26]

Visual-RFT: Visual Reinforcement Fine-Tuning

Visual-RFT: Visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.01785 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

CoRR , year=

MM-Eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning , author=. CoRR , year=

work page
[28]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning , author=. arXiv preprint arXiv:2504.01805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

arXiv preprint arXiv:2511.23075 , year=

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models , author=. arXiv preprint arXiv:2511.23075 , year=

work page arXiv
[31]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction , author=. arXiv preprint arXiv:2505.20279 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

VLM-R1: A stable and generalizable R1-style large vision-language model , author=. arXiv preprint arXiv:2504.07615 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Pixel Reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers , author=. arXiv preprint arXiv:2506.23918 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Kimi-VL Technical Report

Kimi-VL Technical Report , author=. arXiv preprint arXiv:2504.07491 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

DriveVLM: The convergence of autonomous driving and large vision-language models , author=. arXiv preprint arXiv:2402.12289 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Advances in Neural Information Processing Systems , volume=

Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs , author=. Advances in Neural Information Processing Systems , volume=

work page
[39]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

VGGT: Visual geometry grounded transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[40]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[41]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence , author=. arXiv preprint arXiv:2505.23747 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

SpatialScore: Towards unified evaluation for multimodal spatial understanding , author=. arXiv preprint arXiv:2505.17012 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing , author=. arXiv preprint arXiv:2506.09965 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Thinking in space: How multimodal large language models see, remember, and recall spaces , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[45]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

R1-OneVision: Advancing generalized multimodal reasoning through cross-modal formalization , author=. arXiv preprint arXiv:2503.10615 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

arXiv preprint arXiv:2407.00634 , year=

Tarsier: Recipes for training and evaluating large video description models , author=. arXiv preprint arXiv:2407.00634 , year=

work page arXiv
[47]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model , author=. arXiv preprint arXiv:2401.16420 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

European Conference on Computer Vision , pages=

Internvideo2: Scaling foundation models for multimodal video understanding , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[50]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

ScanNet++: A high-fidelity dataset of 3D indoor scenes , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page
[51]

arXiv preprint arXiv:2504.07954 , year=

Perception-R1: Pioneering perception policy with reinforcement learning , author=. arXiv preprint arXiv:2504.07954 , year=

work page arXiv
[52]

arXiv preprint arXiv:2503.22976 , year=

From flatland to space: Teaching vision-language models to perceive and reason in 3d , author=. arXiv preprint arXiv:2503.22976 , year=

work page arXiv
[53]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Video-3D LLM: Learning position-aware video representation for 3D scene understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[54]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Scene parsing through ADE20K dataset , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[55]

arXiv preprint arXiv:2409.18125 , year=

LLaVA-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness , author=. arXiv preprint arXiv:2409.18125 , year=

work page arXiv
[56]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Video instruction tuning with synthetic data , author=. arXiv preprint arXiv:2410.02713 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages=

work page
[58]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2504.15376 , year=

Towards Understanding Camera Motions in Any Video , author=. arXiv preprint arXiv:2504.15376 , year=

work page arXiv
[60]

arXiv preprint arXiv:2510.08531 , year=

Spatialladder: Progressive training for spatial reasoning in vision-language models , author=. arXiv preprint arXiv:2510.08531 , year=

work page arXiv
[61]

Conference on Robot Learning , pages=

RT-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023
[62]

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark , author=

work page
[63]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Docvlm: Make your vlm an efficient reader , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[64]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[65]

Nature , volume=

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

work page 2025
[66]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

Advances in Neural Information Processing Systems , volume=

Spatialrgpt: Grounded spatial reasoning in vision-language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[68]

DeepSeek-OCR: Contexts Optical Compression

Deepseek-ocr: Contexts optical compression , author=. arXiv preprint arXiv:2510.18234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Advances in Neural Information Processing Systems , volume=

Motionbooth: Motion-aware customized text-to-video generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[70]

3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025

3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding , author=. arXiv preprint arXiv:2507.23478 , year=

work page arXiv
[71]

arXiv preprint arXiv:2506.17545 , year=

Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations , author=. arXiv preprint arXiv:2506.17545 , year=

work page arXiv
[72]

ACM SIGGRAPH 2024 Conference Papers , pages=

Direct-a-video: Customized video generation with user-directed camera movement and object motion , author=. ACM SIGGRAPH 2024 Conference Papers , pages=

work page 2024
[73]

Train on the Test Set

Benchmark Designers Should" Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts , author=. arXiv preprint arXiv:2511.04655 , year=

work page arXiv
[74]

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

Learning visual grounding from generative vision and language model , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

work page 2025
[75]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Streaming dense video captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[76]

Advances in Neural Information Processing Systems , volume=

Sharegpt4video: Improving video understanding and generation with better captions , author=. Advances in Neural Information Processing Systems , volume=

work page
[77]

arXiv preprint arXiv:2504.16072 , year=

Describe anything: Detailed localized image and video captioning , author=. arXiv preprint arXiv:2504.16072 , year=

work page arXiv
[78]

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

mplug-owl3: Towards long image-sequence understanding in multi-modal large language models , author=. arXiv preprint arXiv:2408.04840 , year=

work page internal anchor Pith review arXiv
[79]

openai.com/index/gpt-5-system-card , year=

GPT-5 System Card , author=. openai.com/index/gpt-5-system-card , year=

work page
[80]

arXiv preprint arXiv:2511.19436 , year=

VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection , author=. arXiv preprint arXiv:2511.19436 , year=

work page arXiv

Showing first 80 references.

[1] [1]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

ARKitScenes: A diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data , author=. arXiv preprint arXiv:2111.08897 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Omni3D: A large benchmark and model for 3D object detection in the wild , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[4] [4]

Advances in Neural Information Processing Systems , volume=

HourVideo: 1-hour video-language understanding , author=. Advances in Neural Information Processing Systems , volume=

work page

[5] [5]

2025 , url=

Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models , author=. 2025 , url=

work page 2025

[6] [6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

ScanNet: Richly-annotated 3D reconstructions of indoor scenes , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page

[8] [8]

GRIT: Teaching MLLMs to Think with Images

Grit: Teaching MLLMs to think with images , author=. arXiv preprint arXiv:2505.15879 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Video-R1: Reinforcing Video Reasoning in MLLMs

Video-r1: Reinforcing video reasoning in mllms , author=. arXiv preprint arXiv:2503.21776 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Advances in Neural Information Processing Systems , volume=

3D-LLM: Injecting the 3D world into large language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[11] [11]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Vision-R1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2310.19785 , year=

What's "up" with vision-language models? Investigating their struggle with spatial reasoning , author=. arXiv preprint arXiv:2310.19785 , year=

work page arXiv

[15] [15]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. arXiv preprint arXiv:2302.09664 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Kullback-Leibler divergence , author=

work page

[17] [17]

Transactions on Machine Learning Research , year=

LLaVA-OneVision: Easy Visual Task Transfer , author=. Transactions on Machine Learning Research , year=

work page

[18] [18]

arXiv preprint arXiv:2505.21500 , year=

ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models , author=. arXiv preprint arXiv:2505.21500 , year=

work page arXiv

[19] [19]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models , author=. arXiv preprint arXiv:2407.07895 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

VideoChat-R1: Enhancing spatio-temporal perception via reinforcement fine-tuning , author=. arXiv preprint arXiv:2504.06958 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

arXiv preprint arXiv:2503.23765 , year=

STI-Bench: Are MLLMs ready for precise spatial-temporal world understanding? , author=. arXiv preprint arXiv:2503.23765 , year=

work page arXiv

[22] [22]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Self-rewarding vision-language model via reasoning decomposition , author=. arXiv preprint arXiv:2508.19652 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

arXiv preprint arXiv:2504.00883 , year=

Improved visual-spatial reasoning via r1-zero-like training , author=. arXiv preprint arXiv:2504.00883 , year=

work page arXiv

[24] [24]

European Conference on Computer Vision , pages=

Microsoft COCO: Common objects in context , author=. European Conference on Computer Vision , pages=. 2014 , organization=

work page 2014

[25] [25]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Coarse correspondences boost spatial-temporal reasoning in multimodal language model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[26] [26]

Visual-RFT: Visual Reinforcement Fine-Tuning

Visual-RFT: Visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.01785 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

CoRR , year=

MM-Eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning , author=. CoRR , year=

work page

[28] [28]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning , author=. arXiv preprint arXiv:2504.01805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page

[30] [30]

arXiv preprint arXiv:2511.23075 , year=

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models , author=. arXiv preprint arXiv:2511.23075 , year=

work page arXiv

[31] [31]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction , author=. arXiv preprint arXiv:2505.20279 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

VLM-R1: A stable and generalizable R1-style large vision-language model , author=. arXiv preprint arXiv:2504.07615 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Pixel Reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers , author=. arXiv preprint arXiv:2506.23918 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Kimi-VL Technical Report

Kimi-VL Technical Report , author=. arXiv preprint arXiv:2504.07491 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

DriveVLM: The convergence of autonomous driving and large vision-language models , author=. arXiv preprint arXiv:2402.12289 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Advances in Neural Information Processing Systems , volume=

Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs , author=. Advances in Neural Information Processing Systems , volume=

work page

[39] [39]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

VGGT: Visual geometry grounded transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[40] [40]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[41] [41]

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence , author=. arXiv preprint arXiv:2505.23747 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

SpatialScore: Towards unified evaluation for multimodal spatial understanding , author=. arXiv preprint arXiv:2505.17012 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing , author=. arXiv preprint arXiv:2506.09965 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Thinking in space: How multimodal large language models see, remember, and recall spaces , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[45] [45]

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

R1-OneVision: Advancing generalized multimodal reasoning through cross-modal formalization , author=. arXiv preprint arXiv:2503.10615 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

arXiv preprint arXiv:2407.00634 , year=

Tarsier: Recipes for training and evaluating large video description models , author=. arXiv preprint arXiv:2407.00634 , year=

work page arXiv

[47] [47]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model , author=. arXiv preprint arXiv:2401.16420 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

European Conference on Computer Vision , pages=

Internvideo2: Scaling foundation models for multimodal video understanding , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024

[50] [50]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

ScanNet++: A high-fidelity dataset of 3D indoor scenes , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

work page

[51] [51]

arXiv preprint arXiv:2504.07954 , year=

Perception-R1: Pioneering perception policy with reinforcement learning , author=. arXiv preprint arXiv:2504.07954 , year=

work page arXiv

[52] [52]

arXiv preprint arXiv:2503.22976 , year=

From flatland to space: Teaching vision-language models to perceive and reason in 3d , author=. arXiv preprint arXiv:2503.22976 , year=

work page arXiv

[53] [53]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Video-3D LLM: Learning position-aware video representation for 3D scene understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[54] [54]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

Scene parsing through ADE20K dataset , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page

[55] [55]

arXiv preprint arXiv:2409.18125 , year=

LLaVA-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness , author=. arXiv preprint arXiv:2409.18125 , year=

work page arXiv

[56] [56]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Video instruction tuning with synthetic data , author=. arXiv preprint arXiv:2410.02713 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages=

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages=

work page

[58] [58]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

arXiv preprint arXiv:2504.15376 , year=

Towards Understanding Camera Motions in Any Video , author=. arXiv preprint arXiv:2504.15376 , year=

work page arXiv

[60] [60]

arXiv preprint arXiv:2510.08531 , year=

Spatialladder: Progressive training for spatial reasoning in vision-language models , author=. arXiv preprint arXiv:2510.08531 , year=

work page arXiv

[61] [61]

Conference on Robot Learning , pages=

RT-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023

[62] [62]

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark , author=

work page

[63] [63]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Docvlm: Make your vlm an efficient reader , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page

[64] [64]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[65] [65]

Nature , volume=

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

work page 2025

[66] [66]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[67] [67]

Advances in Neural Information Processing Systems , volume=

Spatialrgpt: Grounded spatial reasoning in vision-language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[68] [68]

DeepSeek-OCR: Contexts Optical Compression

Deepseek-ocr: Contexts optical compression , author=. arXiv preprint arXiv:2510.18234 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

Advances in Neural Information Processing Systems , volume=

Motionbooth: Motion-aware customized text-to-video generation , author=. Advances in Neural Information Processing Systems , volume=

work page

[70] [70]

3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025

3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding , author=. arXiv preprint arXiv:2507.23478 , year=

work page arXiv

[71] [71]

arXiv preprint arXiv:2506.17545 , year=

Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations , author=. arXiv preprint arXiv:2506.17545 , year=

work page arXiv

[72] [72]

ACM SIGGRAPH 2024 Conference Papers , pages=

Direct-a-video: Customized video generation with user-directed camera movement and object motion , author=. ACM SIGGRAPH 2024 Conference Papers , pages=

work page 2024

[73] [73]

Train on the Test Set

Benchmark Designers Should" Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts , author=. arXiv preprint arXiv:2511.04655 , year=

work page arXiv

[74] [74]

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

Learning visual grounding from generative vision and language model , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

work page 2025

[75] [75]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Streaming dense video captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page

[76] [76]

Advances in Neural Information Processing Systems , volume=

Sharegpt4video: Improving video understanding and generation with better captions , author=. Advances in Neural Information Processing Systems , volume=

work page

[77] [77]

arXiv preprint arXiv:2504.16072 , year=

Describe anything: Detailed localized image and video captioning , author=. arXiv preprint arXiv:2504.16072 , year=

work page arXiv

[78] [78]

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

mplug-owl3: Towards long image-sequence understanding in multi-modal large language models , author=. arXiv preprint arXiv:2408.04840 , year=

work page internal anchor Pith review arXiv

[79] [79]

openai.com/index/gpt-5-system-card , year=

GPT-5 System Card , author=. openai.com/index/gpt-5-system-card , year=

work page

[80] [80]

arXiv preprint arXiv:2511.19436 , year=

VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection , author=. arXiv preprint arXiv:2511.19436 , year=

work page arXiv