VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
Pith reviewed 2026-05-18 14:04 UTC · model grok-4.3
The pith
VLM2Vec-V2 trains a single embedding model that handles text, images, videos, and visual documents while improving results on both new and existing benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLM2Vec-V2 is a general-purpose embedding model that supports text, image, video, and visual document inputs and achieves strong performance on newly introduced video and document retrieval tasks while also improving over prior baselines on the original image benchmarks.
What carries the argument
The unified framework that trains one model on a mixed data regime covering text, image, video, and visual document inputs to produce embeddings usable across all four modalities.
If this is right
- Multimodal search and recommendation systems can use one embedding space instead of separate models for images, videos, and documents.
- Retrieval-augmented generation pipelines gain access to video and visual document sources through the same embedding mechanism.
- AI agents can retrieve and reason over mixed visual inputs without switching between modality-specific embedders.
- Future representation learning research can build on the observed generalizability patterns across the new benchmark tasks.
Where Pith is reading between the lines
- If unified training continues to lift image performance as a side effect, practitioners may prefer single models over ensembles even when only images are needed.
- The benchmark expansion suggests similar extensions could be made for audio or 3D data to test whether the same training recipe scales further.
- Effective strategies identified for unified embedding learning might reduce the cost of maintaining separate modality-specific systems in production.
Load-bearing premise
The training procedure and data mixture used for VLM2Vec-V2 will generalize across video and visual-document tasks without requiring modality-specific architectural changes or extra tuning.
What would settle it
A held-out set of video retrieval or visual document tasks where VLM2Vec-V2 scores lower than existing specialized models for those modalities would show the unified approach does not generalize as claimed.
read the original abstract
Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VLM2Vec-V2, a unified framework for learning embeddings across text, images, videos, and visual documents. It presents MMEB-V2, an extended benchmark including new tasks for visual document retrieval, video retrieval, temporal grounding, video classification, and video question answering. The authors report that VLM2Vec-V2 achieves strong performance on the new video and document tasks while also improving upon prior baselines on the original image benchmarks from MMEB.
Significance. If the reported improvements on image benchmarks can be attributed to the unified training procedure and data mixture rather than simply increased training data or a stronger backbone, this work would represent a meaningful advance in multimodal embedding models. It addresses a practical gap in supporting diverse visual modalities for applications such as retrieval-augmented generation and AI agents, and the new benchmark could facilitate future research in this area.
major comments (2)
- [Abstract] Abstract: The abstract asserts that VLM2Vec-V2 improves over prior baselines on the original image benchmarks, yet provides no quantitative results, no details on the volume or composition of image-text pairs relative to the original VLM2Vec training corpus, and no mention of backbone strength or total compute. This leaves open the possibility that observed gains on legacy MMEB tasks are driven by expanded data rather than the unified framework, directly undermining the central attribution claim.
- [Experiments] Experiments section: To substantiate the claim that the unified training procedure enables generalization across modalities without modality-specific changes, the manuscript must include controlled ablations (e.g., VLM2Vec-V2 trained on image-only data with matched volume versus the full video+document mixture). Absent such controls, the image-benchmark gains cannot be confidently credited to the proposed method rather than data scaling.
minor comments (2)
- Add standard deviations or results from multiple random seeds to all reported metrics in the main results tables to allow assessment of statistical reliability.
- [Benchmark] Clarify the exact definition and construction of the five new task types in MMEB-V2 (e.g., how temporal grounding is formulated as a retrieval task) in the benchmark description section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised regarding attribution of gains and experimental controls.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts that VLM2Vec-V2 improves over prior baselines on the original image benchmarks, yet provides no quantitative results, no details on the volume or composition of image-text pairs relative to the original VLM2Vec training corpus, and no mention of backbone strength or total compute. This leaves open the possibility that observed gains on legacy MMEB tasks are driven by expanded data rather than the unified framework, directly undermining the central attribution claim.
Authors: We agree that the abstract would benefit from greater specificity. In the revised version we will insert the key quantitative improvements on the original MMEB image tasks (average gains relative to prior baselines), a concise statement on the scale and composition of the image-text portion of the training mixture, and the backbone and approximate compute used. While the full training details and per-task numbers already appear in the Experiments section, we accept that moving a summary of these facts into the abstract will make the attribution claim more transparent and will reduce the possibility that readers attribute gains solely to data volume. revision: yes
-
Referee: [Experiments] Experiments section: To substantiate the claim that the unified training procedure enables generalization across modalities without modality-specific changes, the manuscript must include controlled ablations (e.g., VLM2Vec-V2 trained on image-only data with matched volume versus the full video+document mixture). Absent such controls, the image-benchmark gains cannot be confidently credited to the proposed method rather than data scaling.
Authors: We acknowledge that the current experiments do not contain an explicit image-only ablation with matched data volume, which would more cleanly isolate the effect of the joint training procedure. We will add this controlled comparison in the revised manuscript: a model trained on an image-only subset whose volume matches the image portion of the full mixture, evaluated on the same legacy MMEB image tasks. This ablation will be reported alongside the existing results to demonstrate whether the observed image gains arise from the unified multi-modal mixture or from data scaling alone. We believe the addition will directly address the referee’s concern while preserving the paper’s central narrative. revision: yes
Circularity Check
No circularity: purely empirical training and benchmarking pipeline
full rationale
The paper introduces MMEB-V2 benchmark and trains VLM2Vec-V2 on a data mixture, then reports measured performance on held-out retrieval, classification, and QA tasks. No mathematical derivation, first-principles prediction, or fitted parameter is presented whose output is forced by construction to equal its own inputs. All claims are externally falsifiable via standard benchmark comparisons and do not rely on self-citation chains or ansatzes that smuggle in the target result. This is a standard empirical ML paper whose central results stand or fall on the reported numbers rather than internal definitional equivalence.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
-
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
-
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
-
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
-
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
-
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
-
CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
CodeMMR creates a unified embedding space for text, code, and images, outperforming baselines by 10 nDCG@10 points and boosting RAG code generation quality.
-
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
CLAY reframes pretrained VLM embedding spaces as text-conditional similarity spaces for adaptive, multi-conditioned image retrieval without additional training.
-
Bottleneck Tokens for Unified Multimodal Retrieval
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
-
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
-
PLUME: Latent Reasoning Based Universal Multimodal Embedding
PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
-
LMEB: Long-horizon Memory Embedding Benchmark
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
-
Adapting MLLMs for Nuanced Video Retrieval
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
-
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
-
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.
-
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
-
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.
-
FreeRet: MLLMs as Training-Free Retrievers
FreeRet enables pretrained MLLMs to act as training-free retrievers via semantically grounded embeddings and reasoning-based reranking, outperforming models trained on millions of pairs on MMEB benchmarks.
-
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
-
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
Reference graph
Works this paper leans on
-
[1]
Llm2vec: Large language models are secretly powerful text encoders
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961,
-
[2]
Boqi Chen, Anuj Khare, Gaurav Kumar, Arjun Akula, and Pradyumna Narayana
URL https://arxiv.org/abs/1907.06987. Boqi Chen, Anuj Khare, Gaurav Kumar, Arjun Akula, and Pradyumna Narayana. Seeing beyond: Enhancing visual question answering with multi-modal retrieval. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pp. 410– 421, Abu Dhabi, UAE, January
-
[3]
URL https://aclanthology.org/2025.coling-industry.35/
Association for Computational Linguistics. URL https://aclanthology.org/2025.coling-industry.35/. David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 190–200,
work page 2025
-
[4]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,
work page 2009
-
[5]
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. arXiv preprint arXiv:2407.01449,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Scaling deep contrastive learning batch size under memory limited setup
Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983,
-
[8]
The "something something" video database for learning and evaluating visual common sense
URL https://arxiv.org/abs/1706.04261. 10 Preprint. Yanhao Jia, Xinyi Wu, Hao Li, Qinglin Zhang, Yuxiao Hu, Shuai Zhao, and Wenqi Fan. Uni-retrieval: A multi-style retrieval framework for stem’s education. arXiv preprint arXiv:2502.05863,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160,
work page internal anchor Pith review Pith/arXiv arXiv
- [10]
-
[11]
doi: 10.1109/ICCV .2011.6126543. Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787,
-
[12]
Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su
doi: 10.1109/CVPR.2014.105. Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812,
-
[13]
Jie Lei, Tamara L. Berg, and Mohit Bansal. Qvhighlights: Detecting moments and highlights in videos via natural language queries. ArXiv, abs/2107.09609,
-
[14]
Mm-embed: Universal multimodal retrieval with multimodal llms
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms. arXiv preprint arXiv:2411.02571,
-
[15]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,
work page 2014
-
[16]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a. Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487,
-
[17]
Lamra: Large multimodal model as your advanced retrieval assistant
Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. arXiv preprint arXiv:2412.01720, 2024b. Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip r...
-
[18]
Mmlongbench-doc: Benchmarking long-context document understanding with visualizations
Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. arXiv preprint arXiv:2407.01523,
-
[19]
Vidore benchmark v2: Raising the bar for visual retrieval
11 Preprint. Quentin Mac´e, Ant´onio Loison, and Manuel Faysse. Vidore benchmark v2: Raising the bar for visual retrieval. arXiv preprint arXiv:2505.17166,
-
[20]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Unirag: Universal retrieval augmentation for large vision language models
Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, and Jimmy Lin. Unirag: Universal retrieval augmentation for large vision language models. In Findings of the Association for Computational Linguistics: NAACL 2025, pp. 2026–2039,
work page 2025
-
[22]
Hollywood in homes: Crowdsourcing data collection for activity understanding
Gunnar A Sigurdsson, G¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 510–526. Springer,
work page 2016
-
[23]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
URL https://arxiv.org/abs/1212.0402. Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction- finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 1102–1121,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Improving text embeddings with large language models
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayihen...
-
[25]
Internvideo2: Scaling video foundation models for multimodal video understanding
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024c. Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Tr...
-
[26]
Videoclip: Contrastive pre-training for zero-shot video-text understanding
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084,
-
[27]
Videococa: Video-text modeling with zero-shot transfer from contrastive captioners
Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, and Jiahui Yu. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979,
-
[28]
Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, and Sung Ju Hwang. Uni- versalrag: Retrieval-augmented generation over multiple corpora with diverse modalities and granularities. arXiv preprint arXiv:2504.20734,
-
[29]
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Huaying Yuan, Jian Ni, Yueze Wang, Junjie Zhou, Zhengyang Liang, Zheng Liu, Zhao Cao, Zhicheng Dou, and Ji-Rong Wen. Momentseeker: A comprehensive benchmark and a strong baseline for moment retrieval within long videos. arXiv preprint arXiv:2502.12558,
-
[31]
Direct preference optimization of video large multimodal models from language model reward
Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, et al. Direct preference optimization of video large multimodal models from language model reward. arXiv preprint arXiv:2404.01258, 2024a. Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, ...
-
[32]
is a dataset composed of 10K open-domain videos, each video clip ranging from 10 to 32 seconds in length and accompanied by a total of 200K captions. Following JSFusion (Yu et al., 2018), we sampled 1K clip-text pairs to incorporate into our benchmark. The query side contains both the instruction and the video caption, while the candidates consist of all ...
work page 2018
-
[33]
consists of 10K videos collected from Flickr, each trimmed to a maximum of 30 seconds. Each video includes approximately 3 to 5 anno- tated pairs of descriptions and their corresponding distinct moments. Following previous work (Liu et al., 2019; Luo et al., 2021), we concatenate these descriptions and perform “paragraph-to-video” retrieval on this benchm...
work page 2019
-
[34]
is a dataset comprising 10K videos collected from YouTube, covering a diverse range of topics. Each video is annotated with high-quality labels for both query-based video moment retrieval and highlight detection. In our embedding benchmark, we adopt the standard practice of ranking candidate clips and evaluating performance using Recall@1. In contrast, th...
work page 2017
-
[35]
Containing 1.8K queries, MomentSeeker consists of 4 subtasks with various query-side modalities
is a dataset designed to benchmark multimodal retrievers on long video moment retrieval tasks. Containing 1.8K queries, MomentSeeker consists of 4 subtasks with various query-side modalities. Additionally, MomentSeeker spans a diverse range of topics, including egocentric videos, cartoons, sports, and movies. For each query, we uniformly sampled nine nega...
work page 2020
-
[36]
Each question is grounded in a 3- minute clip and targets long-range temporal reasoning
is a diagnostic benchmark for long-form video under- standing, constructed from Ego4D and comprising over 5,000 multiple-choice QA pairs spanning more than 250 hours of egocentric video. Each question is grounded in a 3- minute clip and targets long-range temporal reasoning. In our study, we use a subset of 500 questions for which answer annotations are p...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.