pith. machine review for the scientific record. sign in

arxiv: 2512.13511 · v3 · submitted 2025-12-15 · 💻 cs.CV · cs.IR

Recognition: no theorem link

Adapting MLLMs for Nuanced Video Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:16 UTC · model grok-4.3

classification 💻 cs.CV cs.IR
keywords nuanced video retrievalmultimodal large language modelscontrastive learninghard negativestemporal actionsnegationcomposed retrievalmodality gap
0
0 comments X

The pith

Repurposing an MLLM with text-only contrastive training on hard negatives creates embeddings that achieve state-of-the-art nuanced video retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build a single embedding model that handles three specific kinds of subtlety in video search: telling apart actions that run in opposite directions over time, respecting explicit negations in the query, and retrieving videos that match a starting clip plus a text instruction for change. The method takes a multimodal large language model already trained to generate text and converts it into an embedding model by applying contrastive training exclusively on text examples. Hard negatives are sampled from text to force the model to learn the required distinctions. The resulting model reaches state-of-the-art accuracy on every benchmark for these nuanced tasks even though no video data was used during fine-tuning. The authors further show that the text-only process shrinks the separation between text and video vectors in the shared space.

Core claim

We repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with carefully sampled hard negatives that instill the desired nuances in the learned embedding space. Despite the text-only training, our method achieves state of the art performance on all benchmarks for nuanced video retrieval. We also analyze how this improvement is achieved, and show that text-only training reduces the modality gap between text and video embeddings leading to better organization of the embedding space.

What carries the argument

Contrastive fine-tuning of an MLLM on hard-negative text pairs that force learning of temporal, negation, and composed distinctions.

If this is right

  • The model reliably separates temporally opposite actions such as opening a door versus closing a door.
  • Queries containing explicit negators like 'not' or 'none' are handled correctly without retrieving unwanted content.
  • Composed retrieval works when the query combines an example video with a text edit instruction.
  • Text and video embeddings sit closer together in the space, improving overall organization for retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Large language models trained only on text appear to already contain much of the logical and temporal structure needed for video distinctions.
  • The same adaptation pattern could be tested on other modalities or tasks where paired data is scarce but text descriptions are abundant.
  • Systematic expansion of the hard-negative sampling strategy to cover more complex logical combinations might further strengthen performance.

Load-bearing premise

Hard negatives sampled from text data alone are sufficient to instill temporal, negation, and multimodal distinctions that transfer effectively to video embeddings.

What would settle it

If the text-only model scores below existing video-trained baselines on any of the temporal, negation, or composed retrieval benchmarks, the claim that text hard negatives suffice would not hold.

Figures

Figures reproduced from arXiv: 2512.13511 by Andrew Zisserman, Piyush Bagad.

Figure 1
Figure 1. Figure 1: (a) MLLMs (M) can be prompted to output a video embedding using Explicit One-word Limitation prompt [36]. (b) Given that M projects video/text into a common space, we adapt it contrastively solely on text triplets. By including time-aware triplets (shown with a clock), we achieve strong zero-shot retrieval particularly on time-sensitive queries. Below we show retrieval results for two queries where time or… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline to extract time-aware hard negatives. Given a caption from Ego4D, we extract verb-object to verify if it is chi￾ral. If so, we prompt an LLM to generate a time-aware hard nega￾tive and replace the anonymized subject with a realistic one. More recently, E5-V by Jiang et al. [36] extended this idea to embed images, texts or their combination using a separate EOL prompt for each modality. They show t… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results. We show qualitative retrieval re￾sults for various queries with base MLLM (Tarsier-7B) before (left) and after (right) TARA fine-tuning. Since it is hard to see key details, we highlight the part of the video that depicts the desired action. TARA improves understanding of chiral ac￾tions where one needs to distinguish between similar looking tem￾porally opposite action videos. Kindly z… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on data size. The data composition of NLI:Ego4D is fixed to 0.9:0.1 and total number of samples is var￾ied. Beyond n=10, 000, the increment in accuracy is not substan￾tial compared to the number of GPUs hours that increase. Left scale corresponds to blue bars showing accuracy, right scale cor￾responds to orange bars showing GPU hours. 3 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparing Ego4D fraction across datasets. We com￾pare the avg. chiral accuracy with α=0.1, 0.2 across all three datasets. While α=0.2 outperforms α=0.1 on SSv2, the latter generalizes better to EPIC and Charades. v → t (R@1) t → v (mAP) Model Ch. NC All Ch. NC All TARA 84.5±0.6 59.0±0.6 47.0±1.1 83.9±0.5 39.9±0.7 32.0±0.5 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on data composition. This figure plots the chiral (temporal) accuracy (y-axis) vs. non-chiral (static) accu￾racy (x-axis) for different composition data composition. Let α ∈ [0, 1] be the fraction of Ego4D data used during TARA fine￾tuning. α is shown beside each scatter point. We find that using 0.1 ≤ α ≤ 0.6 achieves best trade-off. Models corresponding to α ∈ {0.1, 0.2, . . . , 0.6} vary slight… view at source ↗
Figure 8
Figure 8. Figure 8: More qualitative results from the CiA-Retrieval splits. Left shows top-2 videos retrieved from the base Tarsier model and right shows those from Tarsier adapted with TARA. C. Additional Qualitative Results C.1. Retrieval results. In [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Modality gap. We analyse video and text embeddings from the test set of MSRVTT (n=1000 video-caption pairs). Pushing [something] from right to left SSv2 Moving [something] and [something] closer to each other SSv2 Take off lid EPIC Opening a box Charades Before TARA After TARA [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Some failure cases of TARA. Left shows top-2 videos retrieved from the base Tarsier model and right shows those from Tarsier adapted with TARA. C.2. Some failure cases. There are some cases where TARA fine-tuning does not im￾prove the base model’s abilities for time-sensitive retrieval shown in [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Our objective is to build an embedding model that captures the nuanced relationship between a search query and candidate videos. We cover three aspects of nuanced retrieval: (i) temporal, (ii) negation, and (iii) multimodal. For temporal nuance, we consider chiral actions that need distinguishing between temporally opposite actions like "opening a door" vs. "closing a door". For negation, we consider queries with negators such as "not", "none" that allow user to specify what they do not want. For multimodal nuance, we consider the task of composed retrieval where the query comprises a video along with a text edit instruction. The goal is to develop a unified embedding model that handles such nuances effectively. To that end, we repurpose a Multimodal Large Language Model (MLLM) trained to generate text into an embedding model. We fine-tune it with a contrastive loss on text alone with carefully sampled hard negatives that instill the desired nuances in the learned embedding space. Despite the text-only training, our method achieves state of the art performance on all benchmarks for nuanced video retrieval. We also analyze how this improvement is achieved, and show that text-only training reduces the modality gap between text and video embeddings leading to better organization of the embedding space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes repurposing a Multimodal Large Language Model (MLLM) as an embedding model for nuanced video retrieval by fine-tuning it with contrastive loss exclusively on text data. Hard negatives are sampled to target three nuances: temporal distinctions (e.g., chiral actions such as 'opening a door' vs. 'closing a door'), negation (queries containing 'not' or 'none'), and multimodal composed retrieval (video plus text edit instruction). The central claim is that this text-only training yields state-of-the-art performance on all relevant benchmarks while reducing the modality gap between text and video embeddings.

Significance. If the results and analysis hold, the work would be significant for showing that targeted text-only contrastive fine-tuning can instill transferable temporal, negation, and compositional distinctions in MLLM embeddings without paired video data, providing an efficient path to adapt large models for complex cross-modal retrieval and potentially reducing reliance on expensive multimodal training corpora.

major comments (2)
  1. [Abstract] Abstract: The assertion of 'state of the art performance on all benchmarks' and 'modality-gap reduction' supplies no quantitative metrics, benchmark names, hard-negative sampling procedure, or error analysis, leaving the central empirical claim unevidenced in the provided text.
  2. [Experiments] Experiments/Analysis section: No ablation isolates whether text-only hard negatives (e.g., chiral or negated pairs) actually separate video embeddings at inference time. A direct comparison of pre- vs. post-training video-video similarities for opposite actions, or video-only vs. text-only negative variants, is required to substantiate that distinctions survive the modality gap rather than arising from text clustering alone.
minor comments (2)
  1. [Method] Clarify the precise MLLM backbone, the exact contrastive loss formulation (including temperature and margin hyperparameters), and the criteria used to sample hard negatives from text.
  2. [Method] Add explicit notation for the embedding extraction process from the MLLM (e.g., which token or layer is used) to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract would be strengthened by including quantitative metrics and that targeted ablations would better isolate the effect of text-only hard negatives on video embeddings. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of 'state of the art performance on all benchmarks' and 'modality-gap reduction' supplies no quantitative metrics, benchmark names, hard-negative sampling procedure, or error analysis, leaving the central empirical claim unevidenced in the provided text.

    Authors: We agree the abstract is high-level and lacks specific numbers. The full manuscript reports SOTA results with concrete recall@K metrics on the relevant temporal, negation, and multimodal benchmarks, along with modality-gap analysis via cosine similarities and t-SNE visualizations. In the revision we will expand the abstract to include key quantitative improvements (e.g., recall gains), name the benchmarks, briefly describe the hard-negative sampling strategy, and reference the error analysis already present in the experiments section. revision: yes

  2. Referee: [Experiments] Experiments/Analysis section: No ablation isolates whether text-only hard negatives (e.g., chiral or negated pairs) actually separate video embeddings at inference time. A direct comparison of pre- vs. post-training video-video similarities for opposite actions, or video-only vs. text-only negative variants, is required to substantiate that distinctions survive the modality gap rather than arising from text clustering alone.

    Authors: This observation is correct; the current manuscript shows overall retrieval gains and modality-gap reduction but does not include an explicit pre-/post-training video-video similarity ablation for chiral or negated pairs, nor a video-only versus text-only negative variant comparison. We will add this ablation in the revised Experiments section, reporting average cosine similarities between video embeddings of opposite actions before and after training, as well as results when negatives are drawn from video versus text sources, to demonstrate that the distinctions transfer across the modality gap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning evaluated on external benchmarks

full rationale

The paper presents an empirical adaptation of an MLLM into an embedding model via contrastive fine-tuning on text-only data with hard negatives, claiming improved video retrieval performance on external benchmarks. No derivation chain, equations, or load-bearing steps reduce to self-defined quantities, fitted inputs renamed as predictions, or self-citation chains. The modality-gap reduction is reported as an observed outcome of training rather than a constructed identity, and the method is validated against independent test sets without internal circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard contrastive-learning assumptions that hard negatives will encode the target distinctions and that text-only training will close the modality gap; no new entities are postulated.

free parameters (1)
  • contrastive loss hyperparameters
    Temperature, margin, and batch construction parameters are chosen or tuned and directly affect the embedding space organization.
axioms (1)
  • domain assumption Hard negatives sampled from text can instill temporal, negation, and multimodal distinctions that generalize to video.
    Invoked in the description of the fine-tuning procedure and the claim that text-only training suffices.

pith-pipeline@v0.9.0 · 5517 in / 1202 out tokens · 38657 ms · 2026-05-16T22:16:46.712060+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 13 internal anchors

  1. [1]

    Vision-language models do not understand negation

    Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip HS Torr, Yoon Kim, and Marzyeh Ghas- semi. Vision-language models do not understand negation. InCVPR, 2025. 6, 7, 8, 4

  2. [2]

    Localizing mo- ments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. InICCV, 2017. 2

  3. [3]

    Localizing Mo- ments in Video with Natural Language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing Mo- ments in Video with Natural Language. InICCV, 2017. 1, 8

  4. [4]

    Claude 4 system card: Claude opus 4 and claude sonnet 4, 2025

    Anthropic. Claude 4 system card: Claude opus 4 and claude sonnet 4, 2025. Accessed: 2025-11-13. 4

  5. [5]

    Chirality in action: Time-aware video representation learning by latent straight- ening.arXiv preprint arXiv:2509.08502, 2025

    Piyush Bagad and Andrew Zisserman. Chirality in action: Time-aware video representation learning by latent straight- ening.arXiv preprint arXiv:2509.08502, 2025. 2, 3, 4, 5, 7, 8

  6. [6]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 6

  7. [7]

    Frozen in Time: A Joint Video and Image Encoder for End-to-end Retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in Time: A Joint Video and Image Encoder for End-to-end Retrieval. InICCV, 2021. 1, 2

  8. [8]

    Speednet: Learning the Speediness in Videos

    Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the Speediness in Videos. In CVPR, 2020. 2

  9. [9]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Tengyu Ma, Jiale Zhi, Jathushan Ra- jasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. arXiv preprint arXiv:2504.13181, 2025. 6

  10. [10]

    Revisiting the” Video” in Video-Language Understanding

    Shyamal Buch, Crist ´obal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the” Video” in Video-Language Understanding. InCVPR, 2022. 1, 2, 3

  11. [11]

    Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

    Joao Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In CVPR, 2017. 1, 2, 8

  12. [12]

    Collecting highly paral- lel data for paraphrase evaluation

    David Chen and William B Dolan. Collecting highly paral- lel data for paraphrase evaluation. InProceedings of the 49th annual meeting of the association for computational linguis- tics: human language technologies, 2011. 2

  13. [13]

    Collecting highly paral- lel data for paraphrase evaluation

    David Chen and William B Dolan. Collecting highly paral- lel data for paraphrase evaluation. InProceedings of the 49th annual meeting of the association for computational linguis- tics: human language technologies, 2011. 1, 2, 8

  14. [14]

    Are we on the right way for evaluating large vision-language models?NeurIPS, 37, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?NeurIPS, 37, 2024. 2

  15. [15]

    Unfolding Videos Dynamics Via Tay- lor Expansion.arXiv preprint arXiv:2409.02371, 2024

    Siyi Chen, Minkyu Choi, Zesen Zhao, Kuan Han, Qing Qu, and Zhongming Liu. Unfolding Videos Dynamics Via Tay- lor Expansion.arXiv preprint arXiv:2409.02371, 2024. 2

  16. [16]

    Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024

    Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees GM Snoek, and Yuki M Asano. Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024. 1, 2

  17. [17]

    Tvbench: Redesigning video-language evaluation.Arxiv, 2024

    Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees GM Snoek, and Yuki M Asano. Tvbench: Redesigning video-language evaluation.Arxiv, 2024. 2

  18. [18]

    Think then embed: Generative context improves multimodal embedding.arXiv preprint arXiv:2510.05014,

    Xuanming Cui, Jianpeng Cheng, Hong-you Chen, Satya Narayan Shukla, Abhijeet Awasthi, Xichen Pan, Chaitanya Ahuja, Shlok Kumar Mishra, Qi Guo, Ser-Nam Lim, et al. Think then embed: Generative context improves multimodal embedding.arXiv preprint arXiv:2510.05014,

  19. [19]

    Scaling Egocentric Vision: The EPIC-Kitchens Dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling Egocentric Vision: The EPIC-Kitchens Dataset. In ECCV, 2018. 2, 7, 8, 5

  20. [20]

    TCLR: Temporal Contrastive Learning for Video Representation.Computer Vision and Image Under- standing, 2022

    Ishan Dave, Rohit Gupta, Mamshad Nayeem Rizve, and Mubarak Shah. TCLR: Temporal Contrastive Learning for Video Representation.Computer Vision and Image Under- standing, 2022. 2

  21. [21]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tri- pathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InCVPR, 2025. 2

  22. [22]

    How do you do it? fine-grained action understanding with pseudo-adverbs

    Hazel Doughty and Cees GM Snoek. How do you do it? fine-grained action understanding with pseudo-adverbs. In CVPR, 2022. 2, 7, 8, 5

  23. [23]

    Action modifiers: Learning from adverbs in instructional videos

    Hazel Doughty, Ivan Laptev, Walterio Mayol-Cuevas, and Dima Damen. Action modifiers: Learning from adverbs in instructional videos. InCVPR, 2020. 8

  24. [24]

    Reversed in time: A novel temporal-emphasized benchmark for cross-modal video-text retrieval

    Yang Du, Yuqi Liu, and Qin Jin. Reversed in time: A novel temporal-emphasized benchmark for cross-modal video-text retrieval. InACM MM, 2024. 2, 5, 6, 8

  25. [25]

    Temporal Cycle- Consistency Learning

    Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Temporal Cycle- Consistency Learning. InCVPR, 2019. 2

  26. [26]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language mod- els.arXiv preprint arXiv:2407.01449, 2024. 8

  27. [27]

    SimCSE: Simple Contrastive Learning of Sentence Embeddings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821, 2021. 3, 4

  28. [28]

    Video Time: Properties, Encoders and Evaluation

    Amir Ghodrati, Efstratios Gavves, and Cees GM Snoek. Video Time: Properties, Encoders and Evaluation.arXiv preprint arXiv:1807.06980, 2018. 2 9

  29. [29]

    The” Something Something” Video Database for Learning and Evaluating Visual Common Sense

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” Something Something” Video Database for Learning and Evaluating Visual Common Sense. InICCV, 2017. 2, 8, 5

  30. [30]

    Ego4d: Around the World in 3,000 Hours of Egocentric Video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the World in 3,000 Hours of Egocentric Video. In CVPR, 2022. 2, 4

  31. [31]

    Towards univer- sal video retrieval: Generalizing video embedding via syn- thesized multimodal pyramid curriculum.arXiv preprint arXiv:2510.27571, 2025

    Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Xiaowen Chu. Towards univer- sal video retrieval: Generalizing video embedding via syn- thesized multimodal pyramid curriculum.arXiv preprint arXiv:2510.27571, 2025. 6

  32. [32]

    What makes a video a video: Ana- lyzing temporal information in video understanding models and datasets

    De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and Juan Carlos Niebles. What makes a video a video: Ana- lyzing temporal information in video understanding models and datasets. InCVPR, 2018. 2

  33. [33]

    Space-Time Correspondence as a Contrastive Random Walk.NeurIPS,

    Allan Jabri, Andrew Owens, and Alexei Efros. Space-Time Correspondence as a Contrastive Random Walk.NeurIPS,

  34. [34]

    Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video

    Dinesh Jayaraman and Kristen Grauman. Slow and Steady Feature Analysis: Higher Order Temporal Coherence in Video. InCVPR, 2016. 2

  35. [35]

    Scaling sentence embeddings with large language models

    Ting Jiang, Shaohan Huang, Zhongzhi Luan, Deqing Wang, and Fuzhen Zhuang. Scaling sentence embeddings with large language models. InFindings of the Association for Compu- tational Linguistics: EMNLP 2024, 2024. 3

  36. [36]

    E5-V: Universal Embeddings with Multimodal Large Language Models

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580, 2024. 1, 2, 3, 6, 5

  37. [37]

    Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160, 2024

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160, 2024. 2, 8

  38. [38]

    Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment

    Cijo Jose, Th ´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth ´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha ¨el Ramamonjisoa, Maxime Oquab, et al. Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. InCVPR, 2025. 6, 8

  39. [39]

    Victr: Video-conditioned text representa- tions for activity recognition

    Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, and Michael S Ryoo. Victr: Video-conditioned text representa- tions for activity recognition. InCVPR, 2024. 2

  40. [40]

    Text encoders bottleneck compositionality in contrastive vision- language models.arXiv preprint arXiv:2305.14897, 2023

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. Text encoders bottleneck compositionality in contrastive vision- language models.arXiv preprint arXiv:2305.14897, 2023. 2

  41. [41]

    Self- supervised Video Representation Learning with Space-Time Cubic Puzzles

    Dahun Kim, Donghyeon Cho, and In So Kweon. Self- supervised Video Representation Learning with Space-Time Cubic Puzzles. InAAAI, 2019. 2

  42. [42]

    HMDB: A Large Video Database for Human Motion Recognition

    Hildegard Kuehne, Hueihan Jhuang, Est ´ıbaliz Garrote, Tomaso Poggio, and Thomas Serre. HMDB: A Large Video Database for Human Motion Recognition. InICCV, 2011. 2, 8

  43. [43]

    The language of actions: Recovering the syntax and semantics of goal- directed human activities

    Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal- directed human activities. InCVPR, 2014. 8

  44. [44]

    arXiv preprint arXiv:2206.03428 , year=

    Jie Lei, Tamara L Berg, and Mohit Bansal. Reveal- ing Single Frame Bias for Video-and-Language Learning. arXiv:2206.03428, 2022. 1, 3, 6

  45. [45]

    Revealing single frame bias for video-and-language learning

    Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. InProceedings of the 61st Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), 2023. 2

  46. [46]

    Unmasked teacher: Towards training-efficient video foundation models

    Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. InICCV, 2023. 6

  47. [47]

    Mvbench: A Comprehensive Multi-modal Video Under- standing Benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A Comprehensive Multi-modal Video Under- standing Benchmark. InCVPR, 2024. 2

  48. [48]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.NeurIPS, 35, 2022

    Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.NeurIPS, 35, 2022. 3, 5

  49. [49]

    Egocentric video-language pretraining.NeurIPS, 35, 2022

    Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z Xu, Difei Gao, Rong-Cheng Tu, Wen- zhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining.NeurIPS, 35, 2022. 2

  50. [50]

    Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024. 2

  51. [51]

    Lamra: Large multimodal model as your advanced retrieval assistant

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. In CVPR, 2025. 3, 6, 8

  52. [52]

    Clip4clip: An empirical study of clip for end to end video clip retrieval.arXiv preprint arXiv:2104.08860, 2021

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval.arXiv preprint arXiv:2104.08860, 2021. 6

  53. [53]

    X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval

    Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. X-clip: End-to-end multi-grained con- trastive learning for video-text retrieval. InACM MM, 2022. 2, 6

  54. [54]

    Foundation mod- els for video understanding: A survey.arXiv preprint arXiv:2405.03770, 2024

    Neelu Madan, Andreas Møgelmose, Rajat Modi, Yogesh S Rawat, and Thomas B Moeslund. Foundation mod- els for video understanding: A survey.arXiv preprint arXiv:2405.03770, 2024. 2

  55. [55]

    Vlm2vec-v2: Advancing multimodal em- bedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590, 2025

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al. Vlm2vec-v2: Advancing multimodal em- bedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590, 2025. 2, 3, 6, 8

  56. [56]

    Howto100m: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. InICCV,

  57. [57]

    Verbs in Action: Im- proving Verb Understanding in Video-Language Models

    Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, and Cordelia Schmid. Verbs in Action: Im- proving Verb Understanding in Video-Language Models. In ICCV, 2023. 2, 7, 8, 4

  58. [58]

    Per- ception test: A diagnostic benchmark for multimodal video models.NeurIPS, 36, 2023

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.NeurIPS, 36, 2023. 2

  59. [59]

    Spatiotem- poral Contrastive Video Representation Learning

    Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. Spatiotem- poral Contrastive Video Representation Learning. InCVPR,

  60. [60]

    Learning Transferable Visual Models from Natural Language Supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models from Natural Language Supervi- sion. InICML, 2021. 2, 6, 8

  61. [61]

    Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 36, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.NeurIPS, 36, 2023. 3

  62. [62]

    Broaden Your Views for Self-supervised Video Learning

    Adria Recasens, Pauline Luc, Jean-Baptiste Alayrac, Luyu Wang, Florian Strub, Corentin Tallec, Mateusz Malinowski, Viorica P ˘atr˘aucean, Florent Altch ´e, Michal Valko, et al. Broaden Your Views for Self-supervised Video Learning. In ICCV, 2021. 2

  63. [63]

    Veloc- iti: Benchmarking video-language compositional reasoning with strict entailment

    Darshana Saravanan, Varun Gupta, Darshan Singh, Zee- shan Khan, Vineet Gandhi, and Makarand Tapaswi. Veloc- iti: Benchmarking video-language compositional reasoning with strict entailment. InCVPR, 2025. 2

  64. [64]

    Hollywood in Homes: Crowdsourcing Data Collection for Activity Under- standing

    Gunnar A Sigurdsson, G ¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in Homes: Crowdsourcing Data Collection for Activity Under- standing. InECCV, 2016. 2, 5

  65. [65]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    K Soomro. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild.arXiv:1212.0402, 2012. 2, 8

  66. [66]

    Is ImageNet Worth 1 Video? Learning Strong Image Encoders from 1 Long Unlabelled Video.arXiv preprint arXiv:2310.08584,

    Shashanka Venkataramanan, Mamshad Nayeem Rizve, Jo ˜ao Carreira, Yuki M Asano, and Yannis Avrithis. Is ImageNet Worth 1 Video? Learning Strong Image Encoders from 1 Long Unlabelled Video.arXiv preprint arXiv:2310.08584,

  67. [67]

    Covr: Learning composed video retrieval from web video captions

    Lucas Ventura, Antoine Yang, Cordelia Schmid, and G ¨ul Varol. Covr: Learning composed video retrieval from web video captions. InAAAI, 2024. 2, 3

  68. [68]

    Covr-2: Automatic data construction for composed video retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Lucas Ventura, Antoine Yang, Cordelia Schmid, and G ¨ul Varol. Covr-2: Automatic data construction for composed video retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 3

  69. [69]

    Tarsier: Recipes for training and evaluating large video description models.arXiv preprint arXiv:2407.00634, 2024

    Jiawei Wang, Liping Yuan, Yuchen Zhang, and Haomiao Sun. Tarsier: Recipes for training and evaluating large video description models.arXiv preprint arXiv:2407.00634, 2024. 2, 5

  70. [70]

    Actionclip: A new paradigm for video action recognition

    Mengmeng Wang, Jiazheng Xing, and Yong Liu. Actionclip: A new paradigm for video action recognition.arXiv preprint arXiv:2109.08472, 2021. 2

  71. [71]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 6

  72. [72]

    Vatex: A large-scale, high- quality multilingual dataset for video-and-language research

    Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high- quality multilingual dataset for video-and-language research. InICCV, 2019. 8

  73. [73]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942, 2023. 2, 6

  74. [74]

    Internvideo2: Scaling Foundation Models for Mul- timodal Video Understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling Foundation Models for Mul- timodal Video Understanding. InECCV, 2024. 6

  75. [75]

    InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xi- angyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025. 2

  76. [76]

    Pax- ion: Patching action knowledge in video-language founda- tion models.NeurIPS, 36, 2023

    Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, and Heng Ji. Pax- ion: Patching action knowledge in video-language founda- tion models.NeurIPS, 36, 2023. 2

  77. [77]

    Learning and Using the Arrow of Time

    Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. Learning and Using the Arrow of Time. InCVPR, 2018. 2

  78. [78]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InCVPR, 2021. 2

  79. [79]

    arXiv preprint arXiv:2109.14084 , year=

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021. 2

  80. [80]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016. 1, 2, 8

Showing first 80 references.