Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning

Bin Zhao; Delin Qu; Dong Wang; Haoming Song; Qizhi Chen; Xianqiang Gao; Xuelong Li; Zhigang Wang

arxiv: 2605.27318 · v1 · pith:MCUTHIVEnew · submitted 2026-05-26 · 💻 cs.CV

Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning

Xianqiang Gao , Qizhi Chen , Delin Qu , Haoming Song , Zhigang Wang , Bin Zhao , Dong Wang , Xuelong Li This is my paper

Pith reviewed 2026-06-29 18:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords video spatial reasoninggeometric memoryquestion-guided scoringQ-Formermemory bankVSI-BenchVSTI-Benchspatial reasoning models

0 comments

The pith

Question-guided scoring of geometric evidence in two memory banks enables state-of-the-art video spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Q-GeoMem to solve the problem of accumulating viewpoint-dependent geometric evidence over video time without filling memory with redundant or irrelevant data. It injects camera-conditioned geometry into visual tokens and keeps two banks: one for recent dense features and camera states, the other for compact long-range semantic-geometric evidence. Frames enter the long-range bank only when their Q-Former-derived product of question relevance and novelty exceeds a threshold, with capacity rules handling replacement. Both banks are read and fused before each update. Experiments on VSI-Bench and VSTI-Bench show this yields the highest scores among tested spatial models, confirming that question-specific guidance improves long-horizon reasoning over generic temporal caches.

Core claim

Q-GeoMem injects camera-conditioned geometry into visual tokens and maintains a Fine-Grained Context Bank for recent features plus a Semantic-Geometric Evidence Bank for long-range evidence. Each candidate frame receives a score equal to the product of its Q-Former question relevance and its novelty relative to the retained bank; the score is stored and the bank is kept compact by a capacity-based replacement rule. During reasoning the two memories are read before the update and adaptively fused with the current frame. On VSI-Bench and VSTI-Bench this produces state-of-the-art results among evaluated spatial reasoning models, with ablations confirming the scoring mechanism's contribution.

What carries the argument

The question-guided scoring mechanism that multiplies Q-Former relevance and novelty scores to decide which frames enter the Semantic-Geometric Evidence Bank.

If this is right

The two-bank design separates recent dense context from compact long-range evidence, reducing redundancy while preserving question-useful geometry.
Capacity-based replacement keeps memory size bounded without manual tuning of retention thresholds.
Reading both banks before each update allows the current frame to be fused with prior evidence in a question-aware manner.
Ablation results isolate the scoring step as a major contributor to the observed benchmark gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same relevance-novelty product could be applied to other video tasks that need selective retention, such as action anticipation or object tracking over long sequences.
If the Q-Former scoring generalizes across question types, the framework might extend to non-spatial video-language problems with only minor changes to the geometry injection step.
Testing the method on videos longer than those in VSI-Bench would check whether the capacity rule continues to protect critical evidence as sequence length grows.
Combining the memory banks with stronger camera-pose estimators could produce measurable further gains on benchmarks that stress viewpoint changes.

Load-bearing premise

The product of Q-Former question relevance and novelty scores reliably selects frames that supply useful geometric evidence without discarding information needed for long-horizon reasoning.

What would settle it

Replacing the relevance-novelty product with uniform or random frame selection on VSI-Bench and observing a drop below the reported performance would falsify the necessity of the question-guided scoring.

Figures

Figures reproduced from arXiv: 2605.27318 by Bin Zhao, Delin Qu, Dong Wang, Haoming Song, Qizhi Chen, Xianqiang Gao, Xuelong Li, Zhigang Wang.

**Figure 1.** Figure 1: Motivation of Q-GeoMem. Egocentric indoor videos reveal spatial layout through partial, camera-dependent views, so long-horizon spatial reasoning depends on retaining the right evidence rather than simply storing more frames. For a question such as “How many chairs are in this room?”, FIFO-style memory may mix useful chair observations with irrelevant or repeated views. Q-GeoMem instead treats memory updat… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed framework. Camera-guided geometry fusion first injects spatial cues into frame [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Length-based memory diagnostics. (a) Camera-∆ modulation improves FGCB readout on VSTI-Bench† , especially for short videos and camera movement direction, while its effect is limited on long videos. (b) The proposed SGEB design outperforms FIFO memory on VSI-Bench, with larger gains as the video length increases [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf{\ours}, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-GeoMem introduces a dual-bank memory with question-relevance times novelty scoring for video spatial reasoning, but the abstract supplies no numbers to back the SOTA claim.

read the letter

The paper's main move is replacing a generic temporal cache with two banks—one for recent dense features and camera states, the other for compact long-range semantic-geometric evidence—plus a product of Q-Former relevance and novelty scores to decide retention and replacement. That combination is not described as prior work.

It does a clean job stating the practical problem (irrelevant geometry accumulating in long videos) and giving a coherent engineering response: camera-conditioned geometry injection, score reuse at read time, and adaptive fusion before update. The mention of ablations on the scoring rule shows they at least tried to test the mechanism rather than just assert it.

The soft spot is the complete absence of numbers. The abstract says SOTA on VSI-Bench and VSTI-Bench and that ablations verify the scoring, yet supplies no deltas, baselines, dataset sizes, or variance. Without those, the central claim cannot be checked for effect size or obvious confounds. The stress-test note finds no internal contradiction in the described architecture, which is fair, but that does not substitute for the missing results.

This is for people already working on memory designs inside video-language models. A reader who cares about spatial reasoning over long clips would get concrete design ideas from the banks and scoring rule. It deserves peer review because the problem is real and the proposed structure is specific enough to evaluate, even if the current evidence is thin.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Q-GeoMem, a question-guided geometric memory framework for video spatial reasoning. It injects camera-conditioned geometry into visual tokens and maintains two complementary memory structures: a Fine-Grained Context Bank storing recent dense features and camera states, and a Semantic-Geometric Evidence Bank holding compact long-range evidence. Candidate frames are scored by the product of Q-Former-based question relevance and novelty scores relative to the retained bank; a capacity-based replacement rule is applied, and both banks are read and adaptively fused during reasoning. The central claim is that this design yields state-of-the-art performance on VSI-Bench and VSTI-Bench, with ablations confirming the contribution of the evidence scoring mechanism.

Significance. If the empirical results hold, the work offers a concrete mechanism for making memory management question-dependent rather than generic, which could reduce redundancy in long-horizon video spatial reasoning. The dual-bank architecture together with the relevance-novelty product scoring rule constitutes a specific, testable design choice that directly targets the problem stated in the introduction. The paper supplies an empirical validation plan on two dedicated benchmarks and reports ablations on the scoring component.

major comments (2)

[Abstract / Experiments] Abstract and experimental results section: the manuscript states that Q-GeoMem achieves state-of-the-art performance on VSI-Bench and VSTI-Bench and that ablations verify the scoring mechanism, yet supplies no numerical results, baseline comparisons, dataset statistics, or error bars. Without these data the central empirical claim cannot be assessed.
[Ablations] The weakest assumption (product of Q-Former relevance and novelty reliably selects useful geometric evidence without discarding critical long-horizon information) is presented as validated by ablations, but no quantitative ablation isolating long-horizon cases or measuring information loss is described. This directly bears on whether the reported gains can be attributed to the proposed mechanism.

minor comments (2)

[Introduction / Method] The terms 'Q-Former', 'Fine-Grained Context Bank', and 'Semantic-Geometric Evidence Bank' appear without an initial definition or citation on first use.
[Method] Notation for the relevance-novelty product score and the capacity-based replacement rule should be introduced with explicit equations rather than prose description only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the empirical presentation without altering the core claims.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental results section: the manuscript states that Q-GeoMem achieves state-of-the-art performance on VSI-Bench and VSTI-Bench and that ablations verify the scoring mechanism, yet supplies no numerical results, baseline comparisons, dataset statistics, or error bars. Without these data the central empirical claim cannot be assessed.

Authors: The full manuscript's Experiments section contains tables reporting numerical results on VSI-Bench and VSTI-Bench, baseline comparisons, and ablation studies. We acknowledge that the abstract and the high-level experimental summary do not foreground these numbers. In the revision we will update the abstract to include key quantitative results and ensure the experimental results section explicitly presents dataset statistics, baseline tables, and error bars to allow direct assessment of the SOTA claim. revision: yes
Referee: [Ablations] The weakest assumption (product of Q-Former relevance and novelty reliably selects useful geometric evidence without discarding critical long-horizon information) is presented as validated by ablations, but no quantitative ablation isolating long-horizon cases or measuring information loss is described. This directly bears on whether the reported gains can be attributed to the proposed mechanism.

Authors: The manuscript already includes quantitative ablations that isolate the contribution of the relevance-novelty scoring rule through controlled comparisons. We agree that explicit long-horizon isolation and information-loss metrics would further strengthen attribution. The revised version will add a dedicated long-horizon ablation and report retention metrics, while retaining the existing ablation results. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical architecture consisting of camera-conditioned geometry injection, dual memory banks, and a Q-Former-based scoring rule for frame selection. No equations, derivations, or predictions are presented that reduce the claimed SOTA performance to a quantity defined by the authors' own prior work or by construction. The central claim rests on benchmark experiments and ablations rather than any self-referential mathematical step. This is the most common honest finding for a design-and-evaluation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract introduces two new memory structures and a scoring mechanism whose effectiveness is asserted via benchmark results; no free parameters, axioms, or invented entities are quantified or justified beyond the high-level description.

axioms (1)

domain assumption Scoring frames by the product of question relevance and novelty selects evidence that improves spatial reasoning performance
Used to decide storage and replacement in both banks

invented entities (2)

Fine-Grained Context Bank no independent evidence
purpose: Stores recent dense features and camera states
New component introduced for short-term context
Semantic-Geometric Evidence Bank no independent evidence
purpose: Stores compact long-range evidence
New component introduced for long-horizon evidence

pith-pipeline@v0.9.1-grok · 5758 in / 1200 out tokens · 37957 ms · 2026-06-29T18:12:54.236388+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 24 canonical work pages · 12 internal anchors

[1]

HierarQ: Task-aware hierarchical Q-former for enhanced video understanding

Shehreen Azad, Vibhav Vineet, and Yogesh Singh Rawat. HierarQ: Task-aware hierarchical Q-former for enhanced video understanding. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8545–8556, 2025. 9 A PREPRINT

2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Flexible frame selection for efficient video reasoning

Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video reasoning. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29071–29082, 2025

2025
[4]

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. LongVILA: Scaling long-context visual language models for long videos.arXiv:2408.10188, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

ReWind: Understanding long videos with instructed learnable memory.arXiv:2411.15556, 2025

Anxhelo Diko, Tinghuai Wang, Wassim Swaileh, Shiyan Sun, and Ioannis Patras. ReWind: Understanding long videos with instructed learnable memory.arXiv:2411.15556, 2025

work page arXiv 2025
[7]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction. arXiv:2505.20279, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

M-LLM based video frame selection for efficient video understanding

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, and Trishul Chilimbi. M-LLM based video frame selection for efficient video understanding. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13702–13712, 2025

2025
[9]

Online video understanding: OVBench and VideoChat-online

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: OVBench and VideoChat-online. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3328–3338, 2025

2025
[10]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Om- niSpatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv:2506.03135, 2026

work page arXiv 2026
[12]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, and Xiaodan Liang. Thinking with Geometry: Active Geometry Integration for Spatial Reasoning.arXiv:2602.06037, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv:2510.08531, 2025

work page arXiv 2025
[15]

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?arXiv:2503.23765, 2025

Yun Li, Yiming Zhang, Tao Lin, Xiangrui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?arXiv:2503.23765, 2025

work page arXiv 2025
[16]

VILA: On pre-training for visual language models.arXiv:2312.07533, 2024

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. VILA: On pre-training for visual language models.arXiv:2312.07533, 2024

work page arXiv 2024
[17]

Vision-Language Memory for Spatial Reasoning.arXiv:2511.20644, 2025

Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, and Chen Wang. Vision-Language Memory for Spatial Reasoning.arXiv:2511.20644, 2025

work page arXiv 2025
[18]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv:2504.01805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37: 119336–119360, 2024

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37: 119336–119360, 2024. 10 A PREPRINT

2024
[20]

Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24045–24055, 2025

2025
[21]

Adaptive Keyframe Sampling for Long Video Understanding.arXiv:2502.21271, 2025

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive Keyframe Sampling for Long Video Understanding.arXiv:2502.21271, 2025

work page arXiv 2025
[22]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges.arXiv:2409.01071, 2025

Yuxuan Wang, Yiqi Song, Cihang Xie, Yang Liu, and Zilong Zheng. VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges.arXiv:2409.01071, 2025

work page arXiv 2025
[24]

Streaming video understanding and multi-round interaction with memory-enhanced knowl- edge.arXiv preprint arXiv:2501.13468, 2025

Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge.arXiv:2501.13468, 2025

work page arXiv 2025
[25]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, 2025

2025
[26]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning.arXiv:2511.05491, 2025

work page arXiv 2025
[27]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence.arXiv:2505.23764, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Flash-VStream: Efficient real-time understanding for long video streams

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-VStream: Efficient real-time understanding for long video streams. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21059–21069, 2025

2025
[29]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Llava-next: A strong zero-shot video understanding model

Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model. LLaV A Blog, https://llava-vl.github.io/ blog/2024-04-30-llava-next-video, 2024

2024
[31]

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models. arXiv:2511.23075, 2025

work page arXiv 2025
[32]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors.arXiv:2505.24625, 2025

work page arXiv 2025
[33]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

HierarQ: Task-aware hierarchical Q-former for enhanced video understanding

Shehreen Azad, Vibhav Vineet, and Yogesh Singh Rawat. HierarQ: Task-aware hierarchical Q-former for enhanced video understanding. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8545–8556, 2025. 9 A PREPRINT

2025

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Flexible frame selection for efficient video reasoning

Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video reasoning. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29071–29082, 2025

2025

[4] [4]

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. LongVILA: Scaling long-context visual language models for long videos.arXiv:2408.10188, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

ReWind: Understanding long videos with instructed learnable memory.arXiv:2411.15556, 2025

Anxhelo Diko, Tinghuai Wang, Wassim Swaileh, Shiyan Sun, and Ioannis Patras. ReWind: Understanding long videos with instructed learnable memory.arXiv:2411.15556, 2025

work page arXiv 2025

[7] [7]

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction. arXiv:2505.20279, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

M-LLM based video frame selection for efficient video understanding

Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, and Trishul Chilimbi. M-LLM based video frame selection for efficient video understanding. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13702–13712, 2025

2025

[9] [9]

Online video understanding: OVBench and VideoChat-online

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: OVBench and VideoChat-online. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3328–3338, 2025

2025

[10] [10]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Om- niSpatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv:2506.03135, 2026

work page arXiv 2026

[12] [12]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, and Xiaodan Liang. Thinking with Geometry: Active Geometry Integration for Spatial Reasoning.arXiv:2602.06037, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv:2510.08531, 2025

work page arXiv 2025

[15] [15]

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?arXiv:2503.23765, 2025

Yun Li, Yiming Zhang, Tao Lin, Xiangrui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?arXiv:2503.23765, 2025

work page arXiv 2025

[16] [16]

VILA: On pre-training for visual language models.arXiv:2312.07533, 2024

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. VILA: On pre-training for visual language models.arXiv:2312.07533, 2024

work page arXiv 2024

[17] [17]

Vision-Language Memory for Spatial Reasoning.arXiv:2511.20644, 2025

Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, and Chen Wang. Vision-Language Memory for Spatial Reasoning.arXiv:2511.20644, 2025

work page arXiv 2025

[18] [18]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv:2504.01805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37: 119336–119360, 2024

Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37: 119336–119360, 2024. 10 A PREPRINT

2024

[20] [20]

Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24045–24055, 2025

2025

[21] [21]

Adaptive Keyframe Sampling for Long Video Understanding.arXiv:2502.21271, 2025

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive Keyframe Sampling for Long Video Understanding.arXiv:2502.21271, 2025

work page arXiv 2025

[22] [22]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges.arXiv:2409.01071, 2025

Yuxuan Wang, Yiqi Song, Cihang Xie, Yang Liu, and Zilong Zheng. VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges.arXiv:2409.01071, 2025

work page arXiv 2025

[24] [24]

Streaming video understanding and multi-round interaction with memory-enhanced knowl- edge.arXiv preprint arXiv:2501.13468, 2025

Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge.arXiv:2501.13468, 2025

work page arXiv 2025

[25] [25]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, 2025

2025

[26] [26]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning.arXiv:2511.05491, 2025

work page arXiv 2025

[27] [27]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence.arXiv:2505.23764, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Flash-VStream: Efficient real-time understanding for long video streams

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-VStream: Efficient real-time understanding for long video streams. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21059–21069, 2025

2025

[29] [29]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv:2406.16852, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Llava-next: A strong zero-shot video understanding model

Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model. LLaV A Blog, https://llava-vl.github.io/ blog/2024-04-30-llava-next-video, 2024

2024

[31] [31]

SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models. arXiv:2511.23075, 2025

work page arXiv 2025

[32] [32]

Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors.arXiv:2505.24625, 2025

work page arXiv 2025

[33] [33]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025