pith. sign in

arxiv: 2605.20165 · v1 · pith:M4SC5G5Dnew · submitted 2026-05-19 · 💻 cs.CV

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

Pith reviewed 2026-05-20 05:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsspatial understandingcamera motionevaluation frameworkspatial narratives3D spatial intelligence
0
0 comments X

The pith

Vision-language models that score well on spatial questions still lack understanding of camera motion until trained to produce explicit scene and motion narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that strong results on spatial question answering benchmarks do not prove genuine 3D spatial intelligence in vision-language models, since these models routinely overlook how the camera is moving through a scene. To expose the gap, the authors create the Spatial Narrative Score, which asks a model to first write out a full description of the scene contents and the camera's path before a separate frozen language model uses that description to answer questions. Leading models suffer large drops under this requirement even when they handle direct questions accurately. The authors then train CaMo, a vision-language model explicitly grounded in camera motion, and show it maintains steady performance on both the new narrative test and standard spatial questions.

Core claim

State-of-the-art spatial vision-language models exhibit significant performance degradation under the Spatial Narrative Score despite high direct question answering accuracy. CaMo, a camera motion grounded VLM, achieves consistent performance across SNS evaluation and direct spatial question answering accuracy, showing that explicit spatial narrative externalization supports transferable 3D spatial understanding.

What carries the argument

The Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM.

If this is right

  • Direct spatial question answering accuracy alone is insufficient evidence of genuine 3D spatial intelligence in VLMs.
  • Explicit training on camera motion produces more consistent results across different ways of probing spatial understanding.
  • Externalizing spatial narratives makes it possible to separate and measure the quality of a model's internal scene representation.
  • Camera-motion grounding can be added to existing VLM training pipelines without harming performance on conventional spatial QA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same narrative-externalization approach could be adapted to evaluate understanding of object dynamics or multi-view consistency in video sequences.
  • If the consistency gains hold, CaMo-style training might improve downstream performance in robotics tasks that require predicting how scenes change under known camera paths.
  • Using a frozen proxy LLM after narrative generation may help isolate whether gains come from better visual grounding rather than improved language reasoning.

Load-bearing premise

That forcing a model to output explicit spatial narratives including camera motion and then routing those narratives to a separate frozen language model for reasoning truly measures transferable 3D spatial understanding rather than just the ability to produce good descriptions.

What would settle it

A test in which CaMo-trained models show no advantage over baseline models on spatial reasoning tasks that provide no camera-motion information and require no narrative generation.

Figures

Figures reproduced from arXiv: 2605.20165 by Cheng-Yen Yang, Hsiang-Wei Huang, Jenq-Neng Hwang, Jianxu Shangguan, Junbin Lu, Kuang-Ming Chen.

Figure 1
Figure 1. Figure 1: Fine-tuning on spatial QA data improves VLM’s (Li et al., 2025b) spatial question answering, but does not translate to better camera motion understanding compared to base model. reinforcement learning method such as Group Relative Pol￾icy Optimization (GRPO) (Ouyang et al., 2025; Li et al., 2025b). Notably, the introduction of GRPO (Shao et al., 2024; Guo et al., 2025) has emerged as a dominant strategy an… view at source ↗
Figure 2
Figure 2. Figure 2: Left. CaMo-30K features detailed video semantic and camera motion caption in a structured format. Right. Our data composes a mixture of image, multi-view, and video spatial understanding QA pairs and our spatial narrative data, with a total of 30K samples. 3. CaMo-30K Dataset 3.1. Dataset Construction Inspired by human spatial cognition, which relies on the integration of camera motion and visual perceptio… view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline of Spatial Narrative Score Evaluation. The input video segment will be sent to the VLM to generate dense video spatial narrative, which is sent to an proxy LLM to generate prediction for accuracy evaluation. 4. Spatial Narrative Score Evaluation 4.1. Shortcut Learning in Spatial Understanding VLMs Many recent works demonstrate spatial QA-finetuned VLM can achieve substantial accuracy improvements … view at source ↗
Figure 4
Figure 4. Figure 4: Gap Between Direct MCQ Accuracy and SNS. Spatial￾Ladder exhibits a substantial performance drop across all question types under SNS evaluation, whereas CaMo maintains more con￾sistent or even improved performance in all question types. can correctly answer complex spatial questions while failing to describe the underlying camera motion that is essential for constructing a coherent global spatial representa… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Results from VSI-Bench. Compared with the coarse camera motion (highlighted underline) from SpatialLadder, CaMo generates more fine-grained camera motion (highlighted bold), which contains richer spatial information for the SNS evaluation. the LLM to maintain its performance on question answering. Segment Length [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Video example from CaMo-30K [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Video example from CaMo-30K [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Video example from CaMo-30K. A.4. Dataset Examples We showcase several video examples and corresponding spatial narrative annotations from CaMo-30K in [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Word cloud of CaMo-3B [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: Spatial narrative video sample from CaMo [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Spatial narrative video sample from CaMo [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Spatial narrative video sample from CaMo [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Spatial narrative video sample from CaMo [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Failure case video due to VLM’s limitation in generating spatial narrative [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Failure case video with ambiguous and inaccurate ground truth. F. Failure Case Analysis We conduct extensive study on our failure cases, where we found most of the failure cases are caused by the VLM’s inherent limitations and some of the ambiguous ground truth annotations. We discuss these two cases as follows. F.1. VLM Limitation In some failure cases, the VLM fails to generate accurate spatial narrativ… view at source ↗
read the original abstract

Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that state-of-the-art vision-language models achieve high accuracy on direct spatial question answering but exhibit significant performance degradation under the proposed Spatial Narrative Score (SNS) evaluation framework. SNS requires VLMs to generate explicit spatial narratives capturing scene semantics and camera motion, which are then reasoned over by a frozen proxy LLM. The authors introduce CaMo, a camera motion grounded VLM, which achieves consistent performance across SNS and direct QA, arguing this demonstrates the importance of explicit spatial narrative externalization for transferable 3D spatial understanding.

Significance. If SNS is shown to isolate genuine camera motion understanding rather than narrative generation quality, the work would be significant for VLM evaluation and training in computer vision. The public availability of code, data, and model is a clear strength supporting reproducibility.

major comments (1)
  1. [Abstract and SNS Evaluation Framework] The headline claim (Abstract) that SNS degradation demonstrates VLMs lack camera motion understanding is load-bearing on the assumption that the proxy LLM reasoning step accurately measures the VLM's internal spatial cognition. Degradation could instead arise solely from poor narrative generation that the proxy cannot parse effectively. No ablation isolating narrative externalization quality from downstream reasoning is described, which directly undermines the contrast with direct QA accuracy and the interpretation of CaMo's consistent scores.
minor comments (1)
  1. [Methods] Clarify the exact architecture and training details of CaMo in the methods section to allow replication of the camera motion grounding component.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on the SNS evaluation framework below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses
  1. Referee: [Abstract and SNS Evaluation Framework] The headline claim (Abstract) that SNS degradation demonstrates VLMs lack camera motion understanding is load-bearing on the assumption that the proxy LLM reasoning step accurately measures the VLM's internal spatial cognition. Degradation could instead arise solely from poor narrative generation that the proxy cannot parse effectively. No ablation isolating narrative externalization quality from downstream reasoning is described, which directly undermines the contrast with direct QA accuracy and the interpretation of CaMo's consistent scores.

    Authors: We agree that an explicit ablation isolating narrative generation quality from the proxy's downstream reasoning would strengthen the interpretation and directly address potential confounds. The SNS framework is designed to test whether VLMs can externalize spatial understanding (including camera motion) into coherent narratives that support subsequent reasoning by a separate module; the observed degradation relative to direct QA is intended to highlight limitations in this externalization capability rather than in the proxy itself. We selected a strong, frozen general-purpose LLM as the proxy precisely to reduce parsing failures and focus on the quality of the VLM-generated narratives (details in Section 4). To resolve the concern, we will add an ablation study in the revised manuscript that independently evaluates narrative quality (via human ratings and alternative reasoning models) and compares SNS performance under controlled conditions. This will clarify the source of the performance gap and better support the interpretation of CaMo's consistent results across both evaluation modes as evidence for improved camera-motion-grounded externalization. revision: yes

Circularity Check

0 steps flagged

No circularity: SNS evaluation and CaMo training rest on empirical comparison to direct QA

full rationale

The paper defines SNS as an external evaluation procedure that elicits spatial narratives from the target VLM and routes them through a separate frozen proxy LLM for reasoning; performance degradation is then reported as an empirical observation against direct QA baselines. CaMo is introduced as a training method that improves consistency on both metrics. No equations, fitted parameters, or self-citations are shown to reduce the central claims to definitional equivalence or to the inputs by construction; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on abstract; details of training procedure, hyperparameters, and exact implementation not available. Relies on standard assumptions about VLM fine-tuning and proxy LLM reliability for scoring.

axioms (1)
  • domain assumption A frozen proxy LLM can reliably evaluate the quality of spatial narratives generated by the target VLM
    Central to the SNS scoring process described in abstract.

pith-pipeline@v0.9.0 · 5710 in / 1168 out tokens · 48213 ms · 2026-05-20T05:23:16.688853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 32 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=

  2. [2]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    ARKitScenes: A diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data , author=. arXiv preprint arXiv:2111.08897 , year=

  3. [3]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Omni3D: A large benchmark and model for 3D object detection in the wild , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  4. [4]

    Advances in Neural Information Processing Systems , volume=

    HourVideo: 1-hour video-language understanding , author=. Advances in Neural Information Processing Systems , volume=

  5. [5]

    2025 , url=

    Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models , author=. 2025 , url=

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=

  7. [7]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    ScanNet: Richly-annotated 3D reconstructions of indoor scenes , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  8. [8]

    GRIT: Teaching MLLMs to Think with Images

    Grit: Teaching MLLMs to think with images , author=. arXiv preprint arXiv:2505.15879 , year=

  9. [9]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Video-r1: Reinforcing video reasoning in mllms , author=. arXiv preprint arXiv:2503.21776 , year=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    3D-LLM: Injecting the 3D world into large language models , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Vision-R1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=

  12. [12]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  13. [13]

    Mixtral of Experts

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  14. [14]

    arXiv preprint arXiv:2310.19785 , year=

    What's "up" with vision-language models? Investigating their struggle with spatial reasoning , author=. arXiv preprint arXiv:2310.19785 , year=

  15. [15]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. arXiv preprint arXiv:2302.09664 , year=

  16. [16]

    Kullback-Leibler divergence , author=

  17. [17]

    Transactions on Machine Learning Research , year=

    LLaVA-OneVision: Easy Visual Task Transfer , author=. Transactions on Machine Learning Research , year=

  18. [18]

    Viewspatial- bench: Evaluating multi-perspective spatial localization in vision-language models, 2025

    ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models , author=. arXiv preprint arXiv:2505.21500 , year=

  19. [19]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models , author=. arXiv preprint arXiv:2407.07895 , year=

  20. [20]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    VideoChat-R1: Enhancing spatio-temporal perception via reinforcement fine-tuning , author=. arXiv preprint arXiv:2504.06958 , year=

  21. [21]

    arXiv preprint arXiv:2503.23765 , year=

    STI-Bench: Are MLLMs ready for precise spatial-temporal world understanding? , author=. arXiv preprint arXiv:2503.23765 , year=

  22. [22]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    Self-rewarding vision-language model via reasoning decomposition , author=. arXiv preprint arXiv:2508.19652 , year=

  23. [23]

    arXiv preprint arXiv:2504.00883 , year=

    Improved visual-spatial reasoning via r1-zero-like training , author=. arXiv preprint arXiv:2504.00883 , year=

  24. [24]

    European Conference on Computer Vision , pages=

    Microsoft COCO: Common objects in context , author=. European Conference on Computer Vision , pages=. 2014 , organization=

  25. [25]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Coarse correspondences boost spatial-temporal reasoning in multimodal language model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  26. [26]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Visual-RFT: Visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.01785 , year=

  27. [27]

    CoRR , year=

    MM-Eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning , author=. CoRR , year=

  28. [28]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning , author=. arXiv preprint arXiv:2504.01805 , year=

  29. [29]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  30. [30]

    arXiv preprint arXiv:2511.23075 , year=

    SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models , author=. arXiv preprint arXiv:2511.23075 , year=

  31. [31]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction , author=. arXiv preprint arXiv:2505.20279 , year=

  32. [32]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    VLM-R1: A stable and generalizable R1-style large vision-language model , author=. arXiv preprint arXiv:2504.07615 , year=

  33. [33]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Pixel Reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=

  34. [34]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers , author=. arXiv preprint arXiv:2506.23918 , year=

  35. [35]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=

  36. [36]

    Kimi-VL Technical Report

    Kimi-VL Technical Report , author=. arXiv preprint arXiv:2504.07491 , year=

  37. [37]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    DriveVLM: The convergence of autonomous driving and large vision-language models , author=. arXiv preprint arXiv:2402.12289 , year=

  38. [38]

    Advances in Neural Information Processing Systems , volume=

    Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs , author=. Advances in Neural Information Processing Systems , volume=

  39. [39]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    VGGT: Visual geometry grounded transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  40. [40]

    Advances in Neural Information Processing Systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

  41. [41]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence , author=. arXiv preprint arXiv:2505.23747 , year=

  42. [42]

    SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence

    SpatialScore: Towards unified evaluation for multimodal spatial understanding , author=. arXiv preprint arXiv:2505.17012 , year=

  43. [43]

    Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing , author=. arXiv preprint arXiv:2506.09965 , year=

  44. [44]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Thinking in space: How multimodal large language models see, remember, and recall spaces , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  45. [45]

    R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    R1-OneVision: Advancing generalized multimodal reasoning through cross-modal formalization , author=. arXiv preprint arXiv:2503.10615 , year=

  46. [46]

    arXiv preprint arXiv:2407.00634 , year=

    Tarsier: Recipes for training and evaluating large video description models , author=. arXiv preprint arXiv:2407.00634 , year=

  47. [47]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

  48. [48]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model , author=. arXiv preprint arXiv:2401.16420 , year=

  49. [49]

    European Conference on Computer Vision , pages=

    Internvideo2: Scaling foundation models for multimodal video understanding , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  50. [50]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    ScanNet++: A high-fidelity dataset of 3D indoor scenes , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  51. [51]

    Perception-r1: Pioneering perception policy with reinforcement learning, 2025

    Perception-R1: Pioneering perception policy with reinforcement learning , author=. arXiv preprint arXiv:2504.07954 , year=

  52. [52]

    arXiv preprint arXiv:2503.22976 , year=

    From flatland to space: Teaching vision-language models to perceive and reason in 3d , author=. arXiv preprint arXiv:2503.22976 , year=

  53. [53]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Video-3D LLM: Learning position-aware video representation for 3D scene understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  54. [54]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

    Scene parsing through ADE20K dataset , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

  55. [55]

    arXiv preprint arXiv:2409.18125 , year=

    LLaVA-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness , author=. arXiv preprint arXiv:2409.18125 , year=

  56. [56]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Video instruction tuning with synthetic data , author=. arXiv preprint arXiv:2410.02713 , year=

  57. [57]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages=

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages=

  58. [58]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  59. [59]

    arXiv preprint arXiv:2504.15376 , year=

    Towards Understanding Camera Motions in Any Video , author=. arXiv preprint arXiv:2504.15376 , year=

  60. [60]

    arXiv preprint arXiv:2510.08531 , year=

    Spatialladder: Progressive training for spatial reasoning in vision-language models , author=. arXiv preprint arXiv:2510.08531 , year=

  61. [61]

    Conference on Robot Learning , pages=

    RT-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=

  62. [62]

    AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark , author=

  63. [63]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Docvlm: Make your vlm an efficient reader , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  64. [64]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  65. [65]

    Nature , volume=

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

  66. [66]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  67. [67]

    Advances in Neural Information Processing Systems , volume=

    Spatialrgpt: Grounded spatial reasoning in vision-language models , author=. Advances in Neural Information Processing Systems , volume=

  68. [68]

    DeepSeek-OCR: Contexts Optical Compression

    Deepseek-ocr: Contexts optical compression , author=. arXiv preprint arXiv:2510.18234 , year=

  69. [69]

    Advances in Neural Information Processing Systems , volume=

    Motionbooth: Motion-aware customized text-to-video generation , author=. Advances in Neural Information Processing Systems , volume=

  70. [70]

    3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding.arXiv preprint arXiv:2507.23478, 2025

    3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding , author=. arXiv preprint arXiv:2507.23478 , year=

  71. [71]

    arXiv preprint arXiv:2506.17545 , year=

    Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations , author=. arXiv preprint arXiv:2506.17545 , year=

  72. [72]

    ACM SIGGRAPH 2024 Conference Papers , pages=

    Direct-a-video: Customized video generation with user-directed camera movement and object motion , author=. ACM SIGGRAPH 2024 Conference Papers , pages=

  73. [73]

    Train on the Test Set

    Benchmark Designers Should" Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts , author=. arXiv preprint arXiv:2511.04655 , year=

  74. [74]

    2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=

    Learning visual grounding from generative vision and language model , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=

  75. [75]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Streaming dense video captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  76. [76]

    Advances in Neural Information Processing Systems , volume=

    Sharegpt4video: Improving video understanding and generation with better captions , author=. Advances in Neural Information Processing Systems , volume=

  77. [77]

    Describe anything: Detailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025

    Describe anything: Detailed localized image and video captioning , author=. arXiv preprint arXiv:2504.16072 , year=

  78. [78]

    mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

    mplug-owl3: Towards long image-sequence understanding in multi-modal large language models , author=. arXiv preprint arXiv:2408.04840 , year=

  79. [79]

    openai.com/index/gpt-5-system-card , year=

    GPT-5 System Card , author=. openai.com/index/gpt-5-system-card , year=

  80. [80]

    arXiv preprint arXiv:2511.19436 , year=

    VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection , author=. arXiv preprint arXiv:2511.19436 , year=

Showing first 80 references.