pith. sign in

arxiv: 2605.27318 · v1 · pith:MCUTHIVEnew · submitted 2026-05-26 · 💻 cs.CV

Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning

Pith reviewed 2026-06-29 18:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords video spatial reasoninggeometric memoryquestion-guided scoringQ-Formermemory bankVSI-BenchVSTI-Benchspatial reasoning models
0
0 comments X

The pith

Question-guided scoring of geometric evidence in two memory banks enables state-of-the-art video spatial reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Q-GeoMem to solve the problem of accumulating viewpoint-dependent geometric evidence over video time without filling memory with redundant or irrelevant data. It injects camera-conditioned geometry into visual tokens and keeps two banks: one for recent dense features and camera states, the other for compact long-range semantic-geometric evidence. Frames enter the long-range bank only when their Q-Former-derived product of question relevance and novelty exceeds a threshold, with capacity rules handling replacement. Both banks are read and fused before each update. Experiments on VSI-Bench and VSTI-Bench show this yields the highest scores among tested spatial models, confirming that question-specific guidance improves long-horizon reasoning over generic temporal caches.

Core claim

Q-GeoMem injects camera-conditioned geometry into visual tokens and maintains a Fine-Grained Context Bank for recent features plus a Semantic-Geometric Evidence Bank for long-range evidence. Each candidate frame receives a score equal to the product of its Q-Former question relevance and its novelty relative to the retained bank; the score is stored and the bank is kept compact by a capacity-based replacement rule. During reasoning the two memories are read before the update and adaptively fused with the current frame. On VSI-Bench and VSTI-Bench this produces state-of-the-art results among evaluated spatial reasoning models, with ablations confirming the scoring mechanism's contribution.

What carries the argument

The question-guided scoring mechanism that multiplies Q-Former relevance and novelty scores to decide which frames enter the Semantic-Geometric Evidence Bank.

If this is right

  • The two-bank design separates recent dense context from compact long-range evidence, reducing redundancy while preserving question-useful geometry.
  • Capacity-based replacement keeps memory size bounded without manual tuning of retention thresholds.
  • Reading both banks before each update allows the current frame to be fused with prior evidence in a question-aware manner.
  • Ablation results isolate the scoring step as a major contributor to the observed benchmark gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relevance-novelty product could be applied to other video tasks that need selective retention, such as action anticipation or object tracking over long sequences.
  • If the Q-Former scoring generalizes across question types, the framework might extend to non-spatial video-language problems with only minor changes to the geometry injection step.
  • Testing the method on videos longer than those in VSI-Bench would check whether the capacity rule continues to protect critical evidence as sequence length grows.
  • Combining the memory banks with stronger camera-pose estimators could produce measurable further gains on benchmarks that stress viewpoint changes.

Load-bearing premise

The product of Q-Former question relevance and novelty scores reliably selects frames that supply useful geometric evidence without discarding information needed for long-horizon reasoning.

What would settle it

Replacing the relevance-novelty product with uniform or random frame selection on VSI-Bench and observing a drop below the reported performance would falsify the necessity of the question-guided scoring.

Figures

Figures reproduced from arXiv: 2605.27318 by Bin Zhao, Delin Qu, Dong Wang, Haoming Song, Qizhi Chen, Xianqiang Gao, Xuelong Li, Zhigang Wang.

Figure 1
Figure 1. Figure 1: Motivation of Q-GeoMem. Egocentric indoor videos reveal spatial layout through partial, camera-dependent views, so long-horizon spatial reasoning depends on retaining the right evidence rather than simply storing more frames. For a question such as “How many chairs are in this room?”, FIFO-style memory may mix useful chair observations with irrelevant or repeated views. Q-GeoMem instead treats memory updat… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed framework. Camera-guided geometry fusion first injects spatial cues into frame [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Length-based memory diagnostics. (a) Camera-∆ modulation improves FGCB readout on VSTI-Bench† , especially for short videos and camera movement direction, while its effect is limited on long videos. (b) The proposed SGEB design outperforms FIFO memory on VSI-Bench, with larger gains as the video length increases [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf{\ours}, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Q-GeoMem, a question-guided geometric memory framework for video spatial reasoning. It injects camera-conditioned geometry into visual tokens and maintains two complementary memory structures: a Fine-Grained Context Bank storing recent dense features and camera states, and a Semantic-Geometric Evidence Bank holding compact long-range evidence. Candidate frames are scored by the product of Q-Former-based question relevance and novelty scores relative to the retained bank; a capacity-based replacement rule is applied, and both banks are read and adaptively fused during reasoning. The central claim is that this design yields state-of-the-art performance on VSI-Bench and VSTI-Bench, with ablations confirming the contribution of the evidence scoring mechanism.

Significance. If the empirical results hold, the work offers a concrete mechanism for making memory management question-dependent rather than generic, which could reduce redundancy in long-horizon video spatial reasoning. The dual-bank architecture together with the relevance-novelty product scoring rule constitutes a specific, testable design choice that directly targets the problem stated in the introduction. The paper supplies an empirical validation plan on two dedicated benchmarks and reports ablations on the scoring component.

major comments (2)
  1. [Abstract / Experiments] Abstract and experimental results section: the manuscript states that Q-GeoMem achieves state-of-the-art performance on VSI-Bench and VSTI-Bench and that ablations verify the scoring mechanism, yet supplies no numerical results, baseline comparisons, dataset statistics, or error bars. Without these data the central empirical claim cannot be assessed.
  2. [Ablations] The weakest assumption (product of Q-Former relevance and novelty reliably selects useful geometric evidence without discarding critical long-horizon information) is presented as validated by ablations, but no quantitative ablation isolating long-horizon cases or measuring information loss is described. This directly bears on whether the reported gains can be attributed to the proposed mechanism.
minor comments (2)
  1. [Introduction / Method] The terms 'Q-Former', 'Fine-Grained Context Bank', and 'Semantic-Geometric Evidence Bank' appear without an initial definition or citation on first use.
  2. [Method] Notation for the relevance-novelty product score and the capacity-based replacement rule should be introduced with explicit equations rather than prose description only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the empirical presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and experimental results section: the manuscript states that Q-GeoMem achieves state-of-the-art performance on VSI-Bench and VSTI-Bench and that ablations verify the scoring mechanism, yet supplies no numerical results, baseline comparisons, dataset statistics, or error bars. Without these data the central empirical claim cannot be assessed.

    Authors: The full manuscript's Experiments section contains tables reporting numerical results on VSI-Bench and VSTI-Bench, baseline comparisons, and ablation studies. We acknowledge that the abstract and the high-level experimental summary do not foreground these numbers. In the revision we will update the abstract to include key quantitative results and ensure the experimental results section explicitly presents dataset statistics, baseline tables, and error bars to allow direct assessment of the SOTA claim. revision: yes

  2. Referee: [Ablations] The weakest assumption (product of Q-Former relevance and novelty reliably selects useful geometric evidence without discarding critical long-horizon information) is presented as validated by ablations, but no quantitative ablation isolating long-horizon cases or measuring information loss is described. This directly bears on whether the reported gains can be attributed to the proposed mechanism.

    Authors: The manuscript already includes quantitative ablations that isolate the contribution of the relevance-novelty scoring rule through controlled comparisons. We agree that explicit long-horizon isolation and information-loss metrics would further strengthen attribution. The revised version will add a dedicated long-horizon ablation and report retention metrics, while retaining the existing ablation results. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical architecture consisting of camera-conditioned geometry injection, dual memory banks, and a Q-Former-based scoring rule for frame selection. No equations, derivations, or predictions are presented that reduce the claimed SOTA performance to a quantity defined by the authors' own prior work or by construction. The central claim rests on benchmark experiments and ablations rather than any self-referential mathematical step. This is the most common honest finding for a design-and-evaluation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract introduces two new memory structures and a scoring mechanism whose effectiveness is asserted via benchmark results; no free parameters, axioms, or invented entities are quantified or justified beyond the high-level description.

axioms (1)
  • domain assumption Scoring frames by the product of question relevance and novelty selects evidence that improves spatial reasoning performance
    Used to decide storage and replacement in both banks
invented entities (2)
  • Fine-Grained Context Bank no independent evidence
    purpose: Stores recent dense features and camera states
    New component introduced for short-term context
  • Semantic-Geometric Evidence Bank no independent evidence
    purpose: Stores compact long-range evidence
    New component introduced for long-horizon evidence

pith-pipeline@v0.9.1-grok · 5758 in / 1200 out tokens · 37957 ms · 2026-06-29T18:12:54.236388+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 24 canonical work pages · 12 internal anchors

  1. [1]

    HierarQ: Task-aware hierarchical Q-former for enhanced video understanding

    Shehreen Azad, Vibhav Vineet, and Yogesh Singh Rawat. HierarQ: Task-aware hierarchical Q-former for enhanced video understanding. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8545–8556, 2025. 9 A PREPRINT

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL technical report.ar...

  3. [3]

    Flexible frame selection for efficient video reasoning

    Shyamal Buch, Arsha Nagrani, Anurag Arnab, and Cordelia Schmid. Flexible frame selection for efficient video reasoning. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29071–29082, 2025

  4. [4]

    LongVILA: Scaling Long-Context Visual Language Models for Long Videos

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, and Song Han. LongVILA: Scaling long-context visual language models for long videos.arXiv:2408.10188, 2024

  5. [5]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu...

  6. [6]

    ReWind: Understanding long videos with instructed learnable memory.arXiv:2411.15556, 2025

    Anxhelo Diko, Tinghuai Wang, Wassim Swaileh, Shiyan Sun, and Ioannis Patras. ReWind: Understanding long videos with instructed learnable memory.arXiv:2411.15556, 2025

  7. [7]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction. arXiv:2505.20279, 2025

  8. [8]

    M-LLM based video frame selection for efficient video understanding

    Kai Hu, Feng Gao, Xiaohan Nie, Peng Zhou, Son Tran, Tal Neiman, Lingyun Wang, Mubarak Shah, Raffay Hamid, Bing Yin, and Trishul Chilimbi. M-LLM based video frame selection for efficient video understanding. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13702–13712, 2025

  9. [9]

    Online video understanding: OVBench and VideoChat-online

    Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online video understanding: OVBench and VideoChat-online. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3328–3338, 2025

  10. [10]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  11. [11]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Om- niSpatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv:2506.03135, 2026

  12. [12]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  13. [13]

    Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

    Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, and Xiaodan Liang. Thinking with Geometry: Active Geometry Integration for Spatial Reasoning.arXiv:2602.06037, 2026

  14. [14]

    Spatialladder: Progressive training for spatial reasoning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv:2510.08531, 2025

  15. [15]

    STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?arXiv:2503.23765, 2025

    Yun Li, Yiming Zhang, Tao Lin, Xiangrui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?arXiv:2503.23765, 2025

  16. [16]

    VILA: On pre-training for visual language models.arXiv:2312.07533, 2024

    Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. VILA: On pre-training for visual language models.arXiv:2312.07533, 2024

  17. [17]

    Vision-Language Memory for Spatial Reasoning.arXiv:2511.20644, 2025

    Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, and Chen Wang. Vision-Language Memory for Spatial Reasoning.arXiv:2511.20644, 2025

  18. [18]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. SpaceR: Reinforcing MLLMs in video spatial reasoning.arXiv:2504.01805, 2025

  19. [19]

    Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37: 119336–119360, 2024

    Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems, 37: 119336–119360, 2024. 10 A PREPRINT

  20. [20]

    Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction

    Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24045–24055, 2025

  21. [21]

    Adaptive Keyframe Sampling for Long Video Understanding.arXiv:2502.21271, 2025

    Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive Keyframe Sampling for Long Video Understanding.arXiv:2502.21271, 2025

  22. [22]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  23. [23]

    VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges.arXiv:2409.01071, 2025

    Yuxuan Wang, Yiqi Song, Cihang Xie, Yang Liu, and Zilong Zheng. VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges.arXiv:2409.01071, 2025

  24. [24]

    Streaming video understanding and multi-round interaction with memory-enhanced knowl- edge.arXiv preprint arXiv:2501.13468, 2025

    Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge.arXiv:2501.13468, 2025

  25. [25]

    Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

    Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10632–10643, 2025

  26. [26]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, and Hengshuang Zhao. Visual spatial tuning.arXiv:2511.05491, 2025

  27. [27]

    MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence.arXiv:2505.23764, 2025

  28. [28]

    Flash-VStream: Efficient real-time understanding for long video streams

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, and Xiaojie Jin. Flash-VStream: Efficient real-time understanding for long video streams. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21059–21069, 2025

  29. [29]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv:2406.16852, 2024

  30. [30]

    Llava-next: A strong zero-shot video understanding model

    Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model. LLaV A Blog, https://llava-vl.github.io/ blog/2024-04-30-llava-next-video, 2024

  31. [31]

    SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

    Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, and Zizhuang Wei. SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models. arXiv:2511.23075, 2025

  32. [32]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625, 2025

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors.arXiv:2505.24625, 2025

  33. [33]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 11