pith. machine review for the scientific record. sign in

arxiv: 2604.08337 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language pre-traininginstance-aware learningcontrastive alignmentspatial-temporal groundinginstance retrievalvideo understandingzero-shot evaluation
0
0 comments X

The pith

Instance-aware pre-training adds grounded instance contrastive alignment to improve both local retrieval and global understanding in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard vision-language pre-training excels at matching whole scenes to text but falls short on reasoning about specific instances like particular objects or events within them. To address this, it presents a framework that trains on both global alignments and instance-specific contrastive losses, where text is tied to exact locations in space and time. This requires a new dataset with annotations at both scales for images and videos. If successful, the approach would allow models to retrieve and localize precise details better while also boosting or at least not harming overall scene comprehension on video tasks.

Core claim

By jointly optimizing global vision-text alignment and instance-level contrastive alignment through grounding textual mentions to specific spatial-temporal regions in a dual-granularity dataset, the method achieves better instance-level retrieval than prior models and a same-data baseline, while delivering competitive zero-shot performance on video benchmarks.

What carries the argument

The instance-aware pre-training objective that performs fine-grained contrastive alignment by linking textual descriptions to particular spatial-temporal instance regions alongside global scene matching.

If this is right

  • Substantially better performance on instance-level retrieval compared to existing vision-language pre-training models.
  • Outperformance of a strong baseline trained on the identical data corpus, confirming the advantage of the instance-aware component.
  • Competitive zero-shot results on video understanding benchmarks including MSR-VTT and DiDeMo.
  • More precise localization of text mentions to the correct instances rather than diffuse scene-level attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Instance grounding may enable new applications in interactive video querying where users refer to specific objects or actions over time.
  • Similar instance-aware objectives could be tested in other domains like audio-visual or 3D scene understanding to see if the benefits transfer.
  • The dual-granularity annotation approach might reduce the need for task-specific fine-tuning in some retrieval scenarios.

Load-bearing premise

The dense grounded instance descriptions in the dataset can be produced at scale with sufficient accuracy to provide useful supervision without adding significant noise or errors.

What would settle it

Observing no improvement or a decline in instance-level retrieval accuracy when using the instance-aware objective compared to a global-only model trained on the same dataset would falsify the central benefit claimed.

Figures

Figures reproduced from arXiv: 2604.08337 by Ashutosh Kumar, Betty Le Dem, Jingjing Pan, Mingfang Zhang, Mustafa Erdogan, Norimasa Kobori, Quan Kong, Rajat Saini.

Figure 1
Figure 1. Figure 1: Conceptual overview of the InstAP framework and InstVL dataset. Left: InstVL features dual-granularity video annotations: holistic Global Captions and entity-grounded Trajectory Instance Captions. Right: InstAP fuses global and instance-level features via Global-Local Cross Attention, optimizing through joint Global and Instance-Aware Alignment objectives. Abstract Current vision-language pre-training (VLP… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of our InstVL dataset. We display sampled frames with color-coded, temporally-consistent instance trajecto￾ries (e.g., ID: 0, ID: 1). The top text provides the fine-grained instance captions grounded to these trajectories; the bottom text provides the holistic global caption for the entire scene. clips, designed to facilitate instance-aware pre-training. Its key contribution is the dual-granul… view at source ↗
Figure 3
Figure 3. Figure 3: Our instance-aware alignment mechanism. Instance features (Query [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: InstAP consistently retrieves correct fine-grained de [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly optimizes global vision-text alignment and fine-grained, instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. To support this, we present InstVL, a large-scale dataset (2 million images, 50,000 videos) with dual-granularity annotations: holistic scene captions and dense, grounded instance descriptions. On the InstVL benchmark, InstAP substantially outperforms existing VLP models on instance-level retrieval, and also surpasses a strong VLP baseline trained on the exact same data corpus, isolating the benefit of our instance-aware objective. Moreover, instance-centric pre-training improves global understanding: InstAP achieves competitive zero-shot performance on multiple video benchmarks, including MSR-VTT and DiDeMo. Qualitative visualizations further show that InstAP localizes textual mentions to the correct instances, while global-only models exhibit more diffuse, scene-level attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces InstAP, an instance-aware vision-language pre-training framework that jointly optimizes global vision-text alignment and fine-grained instance-level contrastive alignment by grounding textual mentions to specific spatial-temporal regions. It is supported by the new InstVL dataset (2M images, 50k videos) with dual-granularity annotations consisting of holistic scene captions and dense grounded instance descriptions. The central claims are that InstAP substantially outperforms existing VLP models and a same-data baseline on instance-level retrieval while also achieving competitive zero-shot performance on global video benchmarks such as MSR-VTT and DiDeMo.

Significance. If the results hold with clean supervision, this work addresses an important gap in VLP by improving instance-level reasoning alongside global understanding, which is relevant for spatial-temporal tasks. The use of a same-data baseline to isolate the benefit of the instance-aware objective is a methodological strength that supports causal attribution to the proposed loss. The approach could influence future pre-training designs if the grounding annotations prove reliable at scale.

major comments (2)
  1. [InstVL Dataset] InstVL Dataset section: The manuscript provides no validation metrics (e.g., inter-annotator agreement, spot-check error rates, or comparison to human ground truth) for the accuracy of the dense grounded instance descriptions. This is load-bearing for the central empirical claim because the instance-level contrastive objective depends on correct spatial-temporal grounding; systematic misalignment or description errors would introduce noisy supervision that could artifactually inflate instance-retrieval gains or degrade the global alignment term.
  2. [Results] Results section: The abstract reports outperformance on instance retrieval and competitive global results, but the provided text contains no quantitative tables, error bars, training hyperparameters, or statistical significance tests. Without these, the claim that InstAP surpasses the same-data baseline cannot be verified, undermining assessment of whether the instance-aware objective provides a genuine benefit.
minor comments (1)
  1. [Abstract] Abstract: The claims are stated at a high level; adding one sentence on how InstVL annotations were generated (human, automated, or hybrid) would improve clarity without lengthening the abstract excessively.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the InstVL dataset and results presentation. These comments highlight important aspects for strengthening the manuscript's rigor. We address each point below and will make the corresponding revisions.

read point-by-point responses
  1. Referee: [InstVL Dataset] InstVL Dataset section: The manuscript provides no validation metrics (e.g., inter-annotator agreement, spot-check error rates, or comparison to human ground truth) for the accuracy of the dense grounded instance descriptions. This is load-bearing for the central empirical claim because the instance-level contrastive objective depends on correct spatial-temporal grounding; systematic misalignment or description errors would introduce noisy supervision that could artifactually inflate instance-retrieval gains or degrade the global alignment term.

    Authors: We agree that explicit validation metrics are necessary to support the reliability of the grounded annotations used in the instance-level contrastive loss. In the revised manuscript, we will add a new subsection to the InstVL Dataset description reporting inter-annotator agreement scores (computed on a pilot set of 1,000 instances) and spot-check error rates from manual verification of a random sample of 500 descriptions against human ground truth. These additions will quantify annotation quality and address concerns about potential noise in the supervision. revision: yes

  2. Referee: [Results] Results section: The abstract reports outperformance on instance retrieval and competitive global results, but the provided text contains no quantitative tables, error bars, training hyperparameters, or statistical significance tests. Without these, the claim that InstAP surpasses the same-data baseline cannot be verified, undermining assessment of whether the instance-aware objective provides a genuine benefit.

    Authors: We acknowledge the need for complete quantitative reporting to allow verification of the claims. In the revised manuscript, we will expand the Results section to include the main comparison tables with all metrics, standard error bars (from multiple random seeds), a full list of training hyperparameters, and statistical significance tests (e.g., paired t-tests) against the same-data baseline. This will make the performance gains and the benefit of the instance-aware objective directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with independent dataset and baselines

full rationale

The paper introduces a new pre-training framework (InstAP) and supporting dataset (InstVL) with dual-granularity annotations, then reports empirical performance gains on instance-level retrieval and zero-shot video tasks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The central claims rest on direct comparisons to prior VLP models and a same-data baseline, which are externally verifiable and do not reduce to self-definition or input fitting by construction. This is the expected non-finding for an empirical CV paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit hyperparameters, axioms, or new entities are detailed beyond standard contrastive learning assumptions and the quality of the new annotations.

pith-pipeline@v0.9.0 · 5507 in / 1212 out tokens · 71760 ms · 2026-05-10T18:01:37.164187+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 18 canonical work pages · 8 internal anchors

  1. [1]

    Localizing mo- ments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. InProceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017. 6

  2. [2]

    Lat: latent trans- lation with cycle-consistency for video-text retrieval.arXiv preprint arXiv:2207.04858, 2022

    Jinbin Bai, Chunhui Liu, Feiyue Ni, Haofan Wang, Mengy- ing Hu, Xiaofeng Guo, and Lele Cheng. Lat: latent trans- lation with cycle-consistency for video-text retrieval.arXiv preprint arXiv:2207.04858, 2022. 7

  3. [3]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1728–1738,

  4. [4]

    Videocon: Robust video- language alignment via contrast captions

    Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, and Aditya Grover. Videocon: Robust video- language alignment via contrast captions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13927–13937, 2024. 2

  5. [5]

    Coyo-700m: Image-text pair dataset.https : / / github

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https : / / github . com / kakaobrain/coyo-dataset, 2022. 3

  6. [6]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 6

  7. [7]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 2

  8. [8]

    Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021. 6

  9. [9]

    Collecting highly paral- lel data for paraphrase evaluation

    David Chen and William B Dolan. Collecting highly paral- lel data for paraphrase evaluation. InProceedings of the 49th annual meeting of the association for computational linguis- tics: human language technologies, pages 190–200, 2011. 6

  10. [10]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 6

  11. [11]

    Reproducible scal- ing laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023. 6, 7

  12. [12]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 3, 4

  13. [13]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 5

  14. [14]

    arXiv preprint arXiv:2309.17425 (2023) 3, 4, 9, 11, 20, 21, 22

    Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data fil- tering networks.arXiv preprint arXiv:2309.17425, 2023. 6

  15. [15]

    Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958,

    Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al. Masked autoencoders as spatiotemporal learners.Advances in neural information processing systems, 35:35946–35958,

  16. [16]

    Violet: End- to-end video-language transformers with masked visual-token modeling

    Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token mod- eling.arXiv preprint arXiv:2111.12681, 2021. 7

  17. [17]

    Bridging video-text retrieval with multiple choice questions

    Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, Xi- aohu Qie, and Ping Luo. Bridging video-text retrieval with multiple choice questions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16167–16176, 2022. 7

  18. [18]

    Miles: Visual bert pre-training with injected language semantics for video- text retrieval

    Yuying Ge, Yixiao Ge, Xihui Liu, Jinpeng Wang, Jianping Wu, Ying Shan, Xiaohu Qie, and Ping Luo. Miles: Visual bert pre-training with injected language semantics for video- text retrieval. InEuropean conference on computer vision, pages 691–708. Springer, 2022. 7

  19. [19]

    Temporal object captioning for street scene videos from lidar tracks.arXiv preprint arXiv:2505.16594, 2025

    Vignesh Gopinathan, Urs Zimmermann, Michael Arnold, and Matthias Rottmann. Temporal object captioning for street scene videos from lidar tracks.arXiv preprint arXiv:2505.16594, 2025. 2

  20. [20]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 3

  21. [21]

    Clover: Towards a unified video-language alignment and fusion model

    Jingjia Huang, Yinan Li, Jiashi Feng, Xinglong Wu, Xi- aoshuai Sun, and Rongrong Ji. Clover: Towards a unified video-language alignment and fusion model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14856–14866, 2023. 7

  22. [22]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 3 9

  23. [23]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

  24. [24]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017. 2, 6

  25. [25]

    Revealing single frame bias for video-and-language learning

    Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. InProceedings of the 61st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 487–507,

  26. [26]

    Align and prompt: Video-and- language pre-training with entity prompts

    Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. Align and prompt: Video-and- language pre-training with entity prompts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4953–4963, 2022. 7

  27. [27]

    Prototype-based aleatoric uncertainty quantifica- tion for cross-modal retrieval.Advances in Neural Informa- tion Processing Systems, 36:24564–24585, 2023

    Hao Li, Jingkuan Song, Lianli Gao, Xiaosu Zhu, and Heng- tao Shen. Prototype-based aleatoric uncertainty quantifica- tion for cross-modal retrieval.Advances in Neural Informa- tion Processing Systems, 36:24564–24585, 2023. 2

  28. [28]

    Fine-grained semantically aligned vision-language pre-training.Advances in neural information processing sys- tems, 35:7290–7303, 2022

    Juncheng Li, Xin He, Longhui Wei, Long Qian, Linchao Zhu, Lingxi Xie, Yueting Zhuang, Qi Tian, and Siliang Tang. Fine-grained semantically aligned vision-language pre-training.Advances in neural information processing sys- tems, 35:7290–7303, 2022. 3

  29. [29]

    Unmasked teacher: Towards training-efficient video foundation models

    Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 19948–19960, 2023. 3, 4, 6, 7, 8

  30. [30]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022. 3

  31. [31]

    Oscar: Object-semantics aligned pre-training for vision-language tasks

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer,

  32. [32]

    Openvision: A fully-open, cost-effective family of advanced vision encoders for multimodal learning

    Xianhang Li, Yanqing Liu, Haoqin Tu, and Cihang Xie. Openvision: A fully-open, cost-effective family of advanced vision encoders for multimodal learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3977–3987, 2025. 6

  33. [33]

    Token activation map to visually explain mul- timodal llms

    Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, and Xiaomeng Li. Token activation map to visually explain mul- timodal llms. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 48–58, 2025. 8

  34. [34]

    Egocentric video-language pretraining.Advances in Neural Information Processing Sys- tems, 35:7575–7586, 2022

    Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z Xu, Difei Gao, Rong-Cheng Tu, Wen- zhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining.Advances in Neural Information Processing Sys- tems, 35:7575–7586, 2022. 2

  35. [35]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 6

  36. [36]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

  37. [37]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  38. [38]

    Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neu- rocomputing, 508:293–304, 2022

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neu- rocomputing, 508:293–304, 2022. 7

  39. [39]

    Ea-vtr: Event-aware video-text retrieval

    Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Yingmin Luo, Xu Li, Xiaojuan Qi, Ying Shan, et al. Ea-vtr: Event-aware video-text retrieval. InEuropean Conference on Computer Vision, pages 76–94. Springer, 2024. 7

  40. [40]

    Im2text: Describing images using 1 million captioned pho- tographs.Advances in neural information processing sys- tems, 24, 2011

    Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned pho- tographs.Advances in neural information processing sys- tems, 24, 2011. 6

  41. [41]

    Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazeb- nik. Flickr30k entities: Collecting region-to-phrase corre- spondences for richer image-to-sentence models. InPro- ceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015. 2

  42. [42]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 6

  43. [43]

    Sam 2: Segment anything in images and videos,

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,

  44. [44]

    Movie description.International Journal of Computer Vision, 123(1):94–120, 2017

    Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. Movie description.International Journal of Computer Vision, 123(1):94–120, 2017. 6

  45. [45]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo 10 Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3

  46. [46]

    Annotating objects and relations in user- generated videos

    Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. Annotating objects and relations in user- generated videos. InProceedings of the 2019 on Interna- tional Conference on Multimedia Retrieval, pages 279–287,

  47. [47]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 6

  48. [48]

    Masked contrastive pre-training for efficient video-text retrieval.arXiv preprint arXiv:2212.00986, 2022

    Fangxun Shu, Biaolong Chen, Yue Liao, Shuwen Xiao, Wenyu Sun, Xiaobo Li, Yousong Zhu, Jinqiao Wang, and Si Liu. Masked contrastive pre-training for efficient video-text retrieval.arXiv preprint arXiv:2212.00986, 2022. 4

  49. [49]

    Long-form video-language pre- training with multimodal temporal contrastive learning.Ad- vances in neural information processing systems, 35:38032– 38045, 2022

    Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, and Jianlong Fu. Long-form video-language pre- training with multimodal temporal contrastive learning.Ad- vances in neural information processing systems, 35:38032– 38045, 2022. 2

  50. [50]

    Holistic features are almost sufficient for text-to- video retrieval

    Kaibin Tian, Ruixiang Zhao, Zijie Xin, Bangxiang Lan, and Xirong Li. Holistic features are almost sufficient for text-to- video retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17138– 17147, 2024. 2

  51. [51]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022. 3, 4

  52. [52]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

  53. [53]

    Object- aware video-language pre-training for retrieval

    Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Object- aware video-language pre-training for retrieval. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3313–3322, 2022. 7

  54. [54]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023. 7

  55. [55]

    InternVideo2.5: Empowering video MLLMs with long and rich context modeling.arXiv:2501.12386, 2025

    Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xi- angyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. 2, 3

  56. [56]

    Rap: redundancy-aware video-language pre-training for text-video retrieval.arXiv preprint arXiv:2210.06881, 2022

    Xing Wu, Chaochen Gao, Zijia Lin, Zhongyuan Wang, Jizhong Han, and Songlin Hu. Rap: redundancy-aware video-language pre-training for text-video retrieval.arXiv preprint arXiv:2210.06881, 2022. 7

  57. [57]

    Videoclip: Contrastive pre-training for zero-shot video-text understanding

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021. 2, 3

  58. [58]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016. 6

  59. [59]

    Ad- vancing high-resolution video-language representation with large-scale video transcriptions

    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Ad- vancing high-resolution video-language representation with large-scale video transcriptions. InInternational Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

  60. [60]

    Clip-vip: Adapting pre- trained image-text model to video-language alignment

    Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. Clip-vip: Adapting pre- trained image-text model to video-language alignment. In The Eleventh International Conference on Learning Repre- sentations, 2023. 7

  61. [61]

    Multiview transformers for video recognition

    Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, and Cordelia Schmid. Multiview transformers for video recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3333–3343, 2022. 3

  62. [62]

    Learning trajectory-word alignments for video-language tasks

    Xu Yang, Zhangzikang Li, Haiyang Xu, Hanwang Zhang, Qinghao Ye, Chenliang Li, Ming Yan, Yu Zhang, Fei Huang, and Songfang Huang. Learning trajectory-word alignments for video-language tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2504– 2514, 2023. 7

  63. [63]

    Dgl: Dynamic global-local prompt tuning for text-video re- trieval

    Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, and Yi Yang. Dgl: Dynamic global-local prompt tuning for text-video re- trieval. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6540–6548, 2024. 2

  64. [64]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. 7

  65. [65]

    Osprey: Pixel un- derstanding with visual instruction tuning

    Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel un- derstanding with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28202–28211, 2024. 2, 3

  66. [66]

    Mer- lot: Multimodal neural script knowledge models.Advances in neural information processing systems, 34:23634–23651,

    Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Mer- lot: Multimodal neural script knowledge models.Advances in neural information processing systems, 34:23634–23651,

  67. [67]

    Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts.arXiv preprint arXiv:2111.08276, 2021

    Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts.arXiv preprint arXiv:2111.08276, 2021. 3

  68. [68]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 2, 7 11

  69. [69]

    Vinvl: Revisiting visual representations in vision-language models

    Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588,

  70. [70]

    Videoprism: A founda- tional visual encoder for video understanding.arXiv preprint arXiv:2402.13217, 2024

    Long Zhao, Nitesh B Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, et al. Videoprism: A founda- tional visual encoder for video understanding.arXiv preprint arXiv:2402.13217, 2024. 3, 7

  71. [71]

    Regionclip: Region- based language-image pretraining

    Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chun- yuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region- based language-image pretraining. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16793–16803, 2022. 3

  72. [72]

    Grounded video description

    Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J Corso, and Marcus Rohrbach. Grounded video description. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6578–6587, 2019. 2

  73. [73]

    Au- toshot: A short video dataset and state-of-the-art shot bound- ary detection

    Wentao Zhu, Yufang Huang, Xiufeng Xie, Wenxian Liu, Jin- can Deng, Debing Zhang, Zhangyang Wang, and Ji Liu. Au- toshot: A short video dataset and state-of-the-art shot bound- ary detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2238– 2247, 2023. 3 12