CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Pith reviewed 2026-05-20 05:23 UTC · model grok-4.3
pith:M4SC5G5D Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{M4SC5G5D}
Prints a linked pith:M4SC5G5D badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Vision-language models that score well on spatial questions still lack understanding of camera motion until trained to produce explicit scene and motion narratives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
State-of-the-art spatial vision-language models exhibit significant performance degradation under the Spatial Narrative Score despite high direct question answering accuracy. CaMo, a camera motion grounded VLM, achieves consistent performance across SNS evaluation and direct spatial question answering accuracy, showing that explicit spatial narrative externalization supports transferable 3D spatial understanding.
What carries the argument
The Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM.
If this is right
- Direct spatial question answering accuracy alone is insufficient evidence of genuine 3D spatial intelligence in VLMs.
- Explicit training on camera motion produces more consistent results across different ways of probing spatial understanding.
- Externalizing spatial narratives makes it possible to separate and measure the quality of a model's internal scene representation.
- Camera-motion grounding can be added to existing VLM training pipelines without harming performance on conventional spatial QA benchmarks.
Where Pith is reading between the lines
- The same narrative-externalization approach could be adapted to evaluate understanding of object dynamics or multi-view consistency in video sequences.
- If the consistency gains hold, CaMo-style training might improve downstream performance in robotics tasks that require predicting how scenes change under known camera paths.
- Using a frozen proxy LLM after narrative generation may help isolate whether gains come from better visual grounding rather than improved language reasoning.
Load-bearing premise
That forcing a model to output explicit spatial narratives including camera motion and then routing those narratives to a separate frozen language model for reasoning truly measures transferable 3D spatial understanding rather than just the ability to produce good descriptions.
What would settle it
A test in which CaMo-trained models show no advantage over baseline models on spatial reasoning tasks that provide no camera-motion information and require no narrative generation.
Figures
read the original abstract
Vision-Language Models (VLMs) achieve strong performance on spatial question answering benchmarks, yet it remains unclear whether such gains reflect genuine spatial intelligence. We show that existing spatial VLMs lack basic camera motion understanding, a key component of spatial cognition. We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM. Under SNS, state-of-the-art spatial VLMs exhibit significant performance degradation despite high direct question answering accuracy. To address this gap, we introduce CaMo, a camera motion grounded VLM that achieves consistent performance across SNS evaluation and direct spatial question answering accuracy. Our results highlight the importance of explicit spatial narrative externalization for evaluating VLMs with transferable 3D spatial understanding. Our code, data, and model is available at https://github.com/hsiangwei0903/CaMo
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that state-of-the-art vision-language models achieve high accuracy on direct spatial question answering but exhibit significant performance degradation under the proposed Spatial Narrative Score (SNS) evaluation framework. SNS requires VLMs to generate explicit spatial narratives capturing scene semantics and camera motion, which are then reasoned over by a frozen proxy LLM. The authors introduce CaMo, a camera motion grounded VLM, which achieves consistent performance across SNS and direct QA, arguing this demonstrates the importance of explicit spatial narrative externalization for transferable 3D spatial understanding.
Significance. If SNS is shown to isolate genuine camera motion understanding rather than narrative generation quality, the work would be significant for VLM evaluation and training in computer vision. The public availability of code, data, and model is a clear strength supporting reproducibility.
major comments (1)
- [Abstract and SNS Evaluation Framework] The headline claim (Abstract) that SNS degradation demonstrates VLMs lack camera motion understanding is load-bearing on the assumption that the proxy LLM reasoning step accurately measures the VLM's internal spatial cognition. Degradation could instead arise solely from poor narrative generation that the proxy cannot parse effectively. No ablation isolating narrative externalization quality from downstream reasoning is described, which directly undermines the contrast with direct QA accuracy and the interpretation of CaMo's consistent scores.
minor comments (1)
- [Methods] Clarify the exact architecture and training details of CaMo in the methods section to allow replication of the camera motion grounding component.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment on the SNS evaluation framework below and will revise the manuscript accordingly to strengthen the claims.
read point-by-point responses
-
Referee: [Abstract and SNS Evaluation Framework] The headline claim (Abstract) that SNS degradation demonstrates VLMs lack camera motion understanding is load-bearing on the assumption that the proxy LLM reasoning step accurately measures the VLM's internal spatial cognition. Degradation could instead arise solely from poor narrative generation that the proxy cannot parse effectively. No ablation isolating narrative externalization quality from downstream reasoning is described, which directly undermines the contrast with direct QA accuracy and the interpretation of CaMo's consistent scores.
Authors: We agree that an explicit ablation isolating narrative generation quality from the proxy's downstream reasoning would strengthen the interpretation and directly address potential confounds. The SNS framework is designed to test whether VLMs can externalize spatial understanding (including camera motion) into coherent narratives that support subsequent reasoning by a separate module; the observed degradation relative to direct QA is intended to highlight limitations in this externalization capability rather than in the proxy itself. We selected a strong, frozen general-purpose LLM as the proxy precisely to reduce parsing failures and focus on the quality of the VLM-generated narratives (details in Section 4). To resolve the concern, we will add an ablation study in the revised manuscript that independently evaluates narrative quality (via human ratings and alternative reasoning models) and compares SNS performance under controlled conditions. This will clarify the source of the performance gap and better support the interpretation of CaMo's consistent results across both evaluation modes as evidence for improved camera-motion-grounded externalization. revision: yes
Circularity Check
No circularity: SNS evaluation and CaMo training rest on empirical comparison to direct QA
full rationale
The paper defines SNS as an external evaluation procedure that elicits spatial narratives from the target VLM and routes them through a separate frozen proxy LLM for reasoning; performance degradation is then reported as an empirical observation against direct QA baselines. CaMo is introduced as a training method that improves consistency on both metrics. No equations, fitted parameters, or self-citations are shown to reduce the central claims to definitional equivalence or to the inputs by construction; the derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A frozen proxy LLM can reliably evaluate the quality of spatial narratives generated by the target VLM
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose the Spatial Narrative Score (SNS), an evaluation framework that requires VLMs to generate explicit spatial narratives capturing both scene semantics and camera motion, followed by reasoning with a frozen proxy LLM.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CaMo-3B ... achieves consistent performance across SNS evaluation and direct spatial question answering accuracy.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Qwen2.5-vl technical report , author=. arXiv preprint arXiv:2502.13923 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
ARKitScenes: A diverse real-world dataset for 3D indoor scene understanding using mobile RGB-D data , author=. arXiv preprint arXiv:2111.08897 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Omni3D: A large benchmark and model for 3D object detection in the wild , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[4]
Advances in Neural Information Processing Systems , volume=
HourVideo: 1-hour video-language understanding , author=. Advances in Neural Information Processing Systems , volume=
-
[5]
Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models , author=. 2025 , url=
work page 2025
-
[6]
Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling , author=. arXiv preprint arXiv:2412.05271 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
ScanNet: Richly-annotated 3D reconstructions of indoor scenes , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[8]
GRIT: Teaching MLLMs to Think with Images
Grit: Teaching MLLMs to think with images , author=. arXiv preprint arXiv:2505.15879 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-r1: Reinforcing video reasoning in mllms , author=. arXiv preprint arXiv:2503.21776 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Advances in Neural Information Processing Systems , volume=
3D-LLM: Injecting the 3D world into large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[11]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Vision-R1: Incentivizing reasoning capability in multimodal large language models , author=. arXiv preprint arXiv:2503.06749 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
arXiv preprint arXiv:2310.19785 , year=
What's "up" with vision-language models? Investigating their struggle with spatial reasoning , author=. arXiv preprint arXiv:2310.19785 , year=
-
[15]
Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , author=. arXiv preprint arXiv:2302.09664 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Kullback-Leibler divergence , author=
-
[17]
Transactions on Machine Learning Research , year=
LLaVA-OneVision: Easy Visual Task Transfer , author=. Transactions on Machine Learning Research , year=
-
[18]
arXiv preprint arXiv:2505.21500 , year=
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models , author=. arXiv preprint arXiv:2505.21500 , year=
-
[19]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models , author=. arXiv preprint arXiv:2407.07895 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
VideoChat-R1: Enhancing spatio-temporal perception via reinforcement fine-tuning , author=. arXiv preprint arXiv:2504.06958 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
arXiv preprint arXiv:2503.23765 , year=
STI-Bench: Are MLLMs ready for precise spatial-temporal world understanding? , author=. arXiv preprint arXiv:2503.23765 , year=
-
[22]
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Self-rewarding vision-language model via reasoning decomposition , author=. arXiv preprint arXiv:2508.19652 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
arXiv preprint arXiv:2504.00883 , year=
Improved visual-spatial reasoning via r1-zero-like training , author=. arXiv preprint arXiv:2504.00883 , year=
-
[24]
European Conference on Computer Vision , pages=
Microsoft COCO: Common objects in context , author=. European Conference on Computer Vision , pages=. 2014 , organization=
work page 2014
-
[25]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Coarse correspondences boost spatial-temporal reasoning in multimodal language model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[26]
Visual-RFT: Visual Reinforcement Fine-Tuning
Visual-RFT: Visual reinforcement fine-tuning , author=. arXiv preprint arXiv:2503.01785 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
MM-Eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning , author=. CoRR , year=
-
[28]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning , author=. arXiv preprint arXiv:2504.01805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[30]
arXiv preprint arXiv:2511.23075 , year=
SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models , author=. arXiv preprint arXiv:2511.23075 , year=
-
[31]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction , author=. arXiv preprint arXiv:2505.20279 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1: A stable and generalizable R1-style large vision-language model , author=. arXiv preprint arXiv:2504.07615 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Pixel Reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning , author=. arXiv preprint arXiv:2505.15966 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers , author=. arXiv preprint arXiv:2506.23918 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Kimi-VL Technical Report , author=. arXiv preprint arXiv:2504.07491 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DriveVLM: The convergence of autonomous driving and large vision-language models , author=. arXiv preprint arXiv:2402.12289 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Advances in Neural Information Processing Systems , volume=
Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs , author=. Advances in Neural Information Processing Systems , volume=
-
[39]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
VGGT: Visual geometry grounded transformer , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[40]
Advances in Neural Information Processing Systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[41]
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Spatial-MLLM: Boosting MLLM capabilities in visual-based spatial intelligence , author=. arXiv preprint arXiv:2505.23747 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
SpatialScore: Towards unified evaluation for multimodal spatial understanding , author=. arXiv preprint arXiv:2505.17012 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing , author=. arXiv preprint arXiv:2506.09965 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Thinking in space: How multimodal large language models see, remember, and recall spaces , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[45]
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
R1-OneVision: Advancing generalized multimodal reasoning through cross-modal formalization , author=. arXiv preprint arXiv:2503.10615 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
arXiv preprint arXiv:2407.00634 , year=
Tarsier: Recipes for training and evaluating large video description models , author=. arXiv preprint arXiv:2407.00634 , year=
-
[47]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model , author=. arXiv preprint arXiv:2401.16420 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
European Conference on Computer Vision , pages=
Internvideo2: Scaling foundation models for multimodal video understanding , author=. European Conference on Computer Vision , pages=. 2024 , organization=
work page 2024
-
[50]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
ScanNet++: A high-fidelity dataset of 3D indoor scenes , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[51]
arXiv preprint arXiv:2504.07954 , year=
Perception-R1: Pioneering perception policy with reinforcement learning , author=. arXiv preprint arXiv:2504.07954 , year=
-
[52]
arXiv preprint arXiv:2503.22976 , year=
From flatland to space: Teaching vision-language models to perceive and reason in 3d , author=. arXiv preprint arXiv:2503.22976 , year=
-
[53]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Video-3D LLM: Learning position-aware video representation for 3D scene understanding , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[54]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Scene parsing through ADE20K dataset , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[55]
arXiv preprint arXiv:2409.18125 , year=
LLaVA-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness , author=. arXiv preprint arXiv:2409.18125 , year=
-
[56]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Video instruction tuning with synthetic data , author=. arXiv preprint arXiv:2410.02713 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) , pages=
-
[58]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
arXiv preprint arXiv:2504.15376 , year=
Towards Understanding Camera Motions in Any Video , author=. arXiv preprint arXiv:2504.15376 , year=
-
[60]
arXiv preprint arXiv:2510.08531 , year=
Spatialladder: Progressive training for spatial reasoning in vision-language models , author=. arXiv preprint arXiv:2510.08531 , year=
-
[61]
Conference on Robot Learning , pages=
RT-2: Vision-language-action models transfer web knowledge to robotic control , author=. Conference on Robot Learning , pages=. 2023 , organization=
work page 2023
-
[62]
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark , author=
-
[63]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Docvlm: Make your vlm an efficient reader , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[64]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[65]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning , author=. Nature , volume=. 2025 , publisher=
work page 2025
-
[66]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
Advances in Neural Information Processing Systems , volume=
Spatialrgpt: Grounded spatial reasoning in vision-language models , author=. Advances in Neural Information Processing Systems , volume=
-
[68]
DeepSeek-OCR: Contexts Optical Compression
Deepseek-ocr: Contexts optical compression , author=. arXiv preprint arXiv:2510.18234 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Advances in Neural Information Processing Systems , volume=
Motionbooth: Motion-aware customized text-to-video generation , author=. Advances in Neural Information Processing Systems , volume=
-
[70]
3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding , author=. arXiv preprint arXiv:2507.23478 , year=
-
[71]
arXiv preprint arXiv:2506.17545 , year=
Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations , author=. arXiv preprint arXiv:2506.17545 , year=
-
[72]
ACM SIGGRAPH 2024 Conference Papers , pages=
Direct-a-video: Customized video generation with user-directed camera movement and object motion , author=. ACM SIGGRAPH 2024 Conference Papers , pages=
work page 2024
-
[73]
Benchmark Designers Should" Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts , author=. arXiv preprint arXiv:2511.04655 , year=
-
[74]
2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=
Learning visual grounding from generative vision and language model , author=. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , pages=. 2025 , organization=
work page 2025
-
[75]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Streaming dense video captioning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[76]
Advances in Neural Information Processing Systems , volume=
Sharegpt4video: Improving video understanding and generation with better captions , author=. Advances in Neural Information Processing Systems , volume=
-
[77]
arXiv preprint arXiv:2504.16072 , year=
Describe anything: Detailed localized image and video captioning , author=. arXiv preprint arXiv:2504.16072 , year=
-
[78]
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
mplug-owl3: Towards long image-sequence understanding in multi-modal large language models , author=. arXiv preprint arXiv:2408.04840 , year=
work page internal anchor Pith review arXiv
-
[79]
openai.com/index/gpt-5-system-card , year=
GPT-5 System Card , author=. openai.com/index/gpt-5-system-card , year=
-
[80]
arXiv preprint arXiv:2511.19436 , year=
VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection , author=. arXiv preprint arXiv:2511.19436 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.