arxiv: 2507.04590 · v1 · pith:BP2MKXXLnew · submitted 2025-07-07 · 💻 cs.CV · cs.CL

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Rui Meng , Ziyan Jiang , Ye Liu , Mingyi Su , Xinyi Yang , Yuepeng Fu , Can Qin , Zeyuan Chen

show 5 more authors

Ran Xu Caiming Xiong Yingbo Zhou Wenhu Chen Semih Yavuz

This is my paper

Pith reviewed 2026-05-18 14:04 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords multimodal embeddingsvideo retrievalvisual document retrievalunified embedding learningMMEB-V2 benchmarktemporal groundingretrieval-augmented generation

0 comments

The pith

VLM2Vec-V2 trains a single embedding model that handles text, images, videos, and visual documents while improving results on both new and existing benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to extend multimodal embedding models beyond natural images to cover videos and visual documents. Current models like the original VLM2Vec focus narrowly on images, which limits their usefulness for real applications such as AI agents, multimodal search, and retrieval-augmented generation. To address this, the authors create the MMEB-V2 benchmark with five new task types spanning visual documents and videos, then train VLM2Vec-V2 as a unified model on mixed data. Experiments show the model performs well on the new retrieval and grounding tasks while also beating prior baselines on the original image benchmarks.

Core claim

VLM2Vec-V2 is a general-purpose embedding model that supports text, image, video, and visual document inputs and achieves strong performance on newly introduced video and document retrieval tasks while also improving over prior baselines on the original image benchmarks.

What carries the argument

The unified framework that trains one model on a mixed data regime covering text, image, video, and visual document inputs to produce embeddings usable across all four modalities.

If this is right

Multimodal search and recommendation systems can use one embedding space instead of separate models for images, videos, and documents.
Retrieval-augmented generation pipelines gain access to video and visual document sources through the same embedding mechanism.
AI agents can retrieve and reason over mixed visual inputs without switching between modality-specific embedders.
Future representation learning research can build on the observed generalizability patterns across the new benchmark tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If unified training continues to lift image performance as a side effect, practitioners may prefer single models over ensembles even when only images are needed.
The benchmark expansion suggests similar extensions could be made for audio or 3D data to test whether the same training recipe scales further.
Effective strategies identified for unified embedding learning might reduce the cost of maintaining separate modality-specific systems in production.

Load-bearing premise

The training procedure and data mixture used for VLM2Vec-V2 will generalize across video and visual-document tasks without requiring modality-specific architectural changes or extra tuning.

What would settle it

A held-out set of video retrieval or visual document tasks where VLM2Vec-V2 scores lower than existing specialized models for those modalities would show the unified approach does not generalize as claimed.

read the original abstract

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLM2Vec-V2 adds video and document tasks via MMEB-V2 and claims image gains too, but those gains may just reflect more training data rather than the unified framework.

read the letter

The main things to know about this paper are that it adds support for videos and visual documents to the VLM2Vec line with a new benchmark called MMEB-V2, and it reports gains on the original image tasks as well. The unified embedding idea is not brand new, but applying it across these modalities is a reasonable next step. What stands out is the creation of MMEB-V2. It includes five new task types: visual document retrieval, video retrieval, temporal grounding, video classification, and video question answering. This covers a broader range of inputs than the first version. They then train VLM2Vec-V2 to handle text, images, videos, and documents in one model. The abstract claims strong performance on the new tasks and better results than baselines on the old image benchmarks. This could be helpful for applications like retrieval-augmented generation or AI agents that deal with mixed media. They do a decent job laying out the motivation and describing the benchmark construction. Extending embeddings this way makes sense for real-world use cases where content isn't just static images. The soft spots are around the evidence. The abstract mentions performance gains but doesn't include any numbers, training details, or ablation studies. That makes it hard to judge if the improvements on image tasks come from the proposed unified framework and data mixture or simply from using more image data or a better base model. The concern about attribution is fair here. If the paper doesn't show controlled experiments that isolate the effect of adding video and document data, then the story about a general unified approach loses some strength. Also, since this is an extension of prior work, the citation pattern seems to build on existing literature without major gaps, but we'd want to see how they compare to other recent multimodal embedders. This paper is aimed at researchers in multimodal learning and retrieval. Someone working on building embedding models for diverse data types would find the benchmark useful and might want to try the model. It deserves a serious referee because the new benchmark and the attempt at unification are concrete contributions that could move the subfield forward, even if the claims need more backing in the full version. I'd recommend sending it out for peer review, but with notes to the authors about providing quantitative results and ablations to support the attribution of gains.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VLM2Vec-V2, a unified framework for learning embeddings across text, images, videos, and visual documents. It presents MMEB-V2, an extended benchmark including new tasks for visual document retrieval, video retrieval, temporal grounding, video classification, and video question answering. The authors report that VLM2Vec-V2 achieves strong performance on the new video and document tasks while also improving upon prior baselines on the original image benchmarks from MMEB.

Significance. If the reported improvements on image benchmarks can be attributed to the unified training procedure and data mixture rather than simply increased training data or a stronger backbone, this work would represent a meaningful advance in multimodal embedding models. It addresses a practical gap in supporting diverse visual modalities for applications such as retrieval-augmented generation and AI agents, and the new benchmark could facilitate future research in this area.

major comments (2)

[Abstract] Abstract: The abstract asserts that VLM2Vec-V2 improves over prior baselines on the original image benchmarks, yet provides no quantitative results, no details on the volume or composition of image-text pairs relative to the original VLM2Vec training corpus, and no mention of backbone strength or total compute. This leaves open the possibility that observed gains on legacy MMEB tasks are driven by expanded data rather than the unified framework, directly undermining the central attribution claim.
[Experiments] Experiments section: To substantiate the claim that the unified training procedure enables generalization across modalities without modality-specific changes, the manuscript must include controlled ablations (e.g., VLM2Vec-V2 trained on image-only data with matched volume versus the full video+document mixture). Absent such controls, the image-benchmark gains cannot be confidently credited to the proposed method rather than data scaling.

minor comments (2)

Add standard deviations or results from multiple random seeds to all reported metrics in the main results tables to allow assessment of statistical reliability.
[Benchmark] Clarify the exact definition and construction of the five new task types in MMEB-V2 (e.g., how temporal grounding is formulated as a retrieval task) in the benchmark description section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised regarding attribution of gains and experimental controls.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts that VLM2Vec-V2 improves over prior baselines on the original image benchmarks, yet provides no quantitative results, no details on the volume or composition of image-text pairs relative to the original VLM2Vec training corpus, and no mention of backbone strength or total compute. This leaves open the possibility that observed gains on legacy MMEB tasks are driven by expanded data rather than the unified framework, directly undermining the central attribution claim.

Authors: We agree that the abstract would benefit from greater specificity. In the revised version we will insert the key quantitative improvements on the original MMEB image tasks (average gains relative to prior baselines), a concise statement on the scale and composition of the image-text portion of the training mixture, and the backbone and approximate compute used. While the full training details and per-task numbers already appear in the Experiments section, we accept that moving a summary of these facts into the abstract will make the attribution claim more transparent and will reduce the possibility that readers attribute gains solely to data volume. revision: yes
Referee: [Experiments] Experiments section: To substantiate the claim that the unified training procedure enables generalization across modalities without modality-specific changes, the manuscript must include controlled ablations (e.g., VLM2Vec-V2 trained on image-only data with matched volume versus the full video+document mixture). Absent such controls, the image-benchmark gains cannot be confidently credited to the proposed method rather than data scaling.

Authors: We acknowledge that the current experiments do not contain an explicit image-only ablation with matched data volume, which would more cleanly isolate the effect of the joint training procedure. We will add this controlled comparison in the revised manuscript: a model trained on an image-only subset whose volume matches the image portion of the full mixture, evaluated on the same legacy MMEB image tasks. This ablation will be reported alongside the existing results to demonstrate whether the observed image gains arise from the unified multi-modal mixture or from data scaling alone. We believe the addition will directly address the referee’s concern while preserving the paper’s central narrative. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training and benchmarking pipeline

full rationale

The paper introduces MMEB-V2 benchmark and trains VLM2Vec-V2 on a data mixture, then reports measured performance on held-out retrieval, classification, and QA tasks. No mathematical derivation, first-principles prediction, or fitted parameter is presented whose output is forced by construction to equal its own inputs. All claims are externally falsifiable via standard benchmark comparisons and do not rely on self-citation chains or ansatzes that smuggle in the target result. This is a standard empirical ML paper whose central results stand or fall on the reported numbers rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. Training of a large multimodal model implicitly involves many hyperparameters and data choices that are not enumerated here.

pith-pipeline@v0.9.0 · 5832 in / 1081 out tokens · 30517 ms · 2026-05-18T14:04:21.187924+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
cs.CV 2026-04 unverdicted novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
cs.CV 2026-05 unverdicted novelty 7.0

AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
cs.IR 2026-04 unverdicted novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
cs.SE 2026-04 unverdicted novelty 7.0

CodeMMR creates a unified embedding space for text, code, and images, outperforming baselines by 10 nDCG@10 points and boosting RAG code generation quality.
CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
cs.CV 2026-04 unverdicted novelty 7.0

CLAY reframes pretrained VLM embedding spaces as text-conditional similarity spaces for adaptive, multi-conditioned image retrieval without additional training.
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
PLUME: Latent Reasoning Based Universal Multimodal Embedding
cs.CV 2026-04 unverdicted novelty 7.0

PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
LMEB: Long-horizon Memory Embedding Benchmark
cs.CL 2026-03 unverdicted novelty 7.0

LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
Adapting MLLMs for Nuanced Video Retrieval
cs.CV 2025-12 unverdicted novelty 7.0

Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
cs.CV 2026-04 unverdicted novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
cs.CV 2026-04 unverdicted novelty 6.0

Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
cs.CV 2026-04 unverdicted novelty 6.0

SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
cs.CL 2026-01 unverdicted novelty 6.0

CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.
FreeRet: MLLMs as Training-Free Retrievers
cs.CV 2025-09 unverdicted novelty 6.0

FreeRet enables pretrained MLLMs to act as training-free retrievers via semantically grounded embeddings and reasoning-based reranking, outperforming models trained on millions of pairs on MMEB benchmarks.
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
cs.IR 2025-09 unverdicted novelty 6.0

MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
cs.CL 2026-01 unverdicted novelty 4.0

Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 18 Pith papers · 7 internal anchors

[1]

Llm2vec: Large language models are secretly powerful text encoders

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961,

work page arXiv
[2]

Boqi Chen, Anuj Khare, Gaurav Kumar, Arjun Akula, and Pradyumna Narayana

URL https://arxiv.org/abs/1907.06987. Boqi Chen, Anuj Khare, Gaurav Kumar, Arjun Akula, and Pradyumna Narayana. Seeing beyond: Enhancing visual question answering with multi-modal retrieval. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pp. 410– 421, Abu Dhabi, UAE, January

work page arXiv 1907
[3]

URL https://aclanthology.org/2025.coling-industry.35/

Association for Computational Linguistics. URL https://aclanthology.org/2025.coling-industry.35/. David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 190–200,

work page 2025
[4]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

work page 2009
[5]

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. arXiv preprint arXiv:2407.01449,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Scaling deep contrastive learning batch size under memory limited setup

Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983,

work page arXiv
[8]

The "something something" video database for learning and evaluating visual common sense

URL https://arxiv.org/abs/1706.04261. 10 Preprint. Yanhao Jia, Xinyi Wu, Hao Li, Qinglin Zhang, Yuxiao Hu, Shuai Zhao, and Wenqi Fan. Uni-retrieval: A multi-style retrieval framework for stem’s education. arXiv preprint arXiv:2502.05863,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Kuehne, H

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp. 2556–2563,

work page 2011
[11]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

doi: 10.1109/ICCV .2011.6126543. Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787,

work page doi:10.1109/iccv 2011
[12]

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su

doi: 10.1109/CVPR.2014.105. Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812,

work page doi:10.1109/cvpr.2014.105 2014
[13]

Berg, and Mohit Bansal

Jie Lei, Tamara L. Berg, and Mohit Bansal. Qvhighlights: Detecting moments and highlights in videos via natural language queries. ArXiv, abs/2107.09609,

work page arXiv
[14]

Mm-embed: Universal multimodal retrieval with multimodal llms

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms. arXiv preprint arXiv:2411.02571,

work page arXiv
[15]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,

work page 2014
[16]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a. Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487,

work page arXiv 1907
[17]

Lamra: Large multimodal model as your advanced retrieval assistant

Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. arXiv preprint arXiv:2412.01720, 2024b. Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip r...

work page arXiv
[18]

Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. arXiv preprint arXiv:2407.01523,

work page arXiv
[19]

Vidore benchmark v2: Raising the bar for visual retrieval

11 Preprint. Quentin Mac´e, Ant´onio Loison, and Manuel Faysse. Vidore benchmark v2: Raising the bar for visual retrieval. arXiv preprint arXiv:2505.17166,

work page arXiv
[20]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Unirag: Universal retrieval augmentation for large vision language models

Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, and Jimmy Lin. Unirag: Universal retrieval augmentation for large vision language models. In Findings of the Association for Computational Linguistics: NAACL 2025, pp. 2026–2039,

work page 2025
[22]

Hollywood in homes: Crowdsourcing data collection for activity understanding

Gunnar A Sigurdsson, G¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 510–526. Springer,

work page 2016
[23]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

URL https://arxiv.org/abs/1212.0402. Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction- finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 1102–1121,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Improving text embeddings with large language models

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayihen...

work page arXiv
[25]

Internvideo2: Scaling video foundation models for multimodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024c. Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Tr...

work page arXiv
[26]

Videoclip: Contrastive pre-training for zero-shot video-text understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084,

work page arXiv
[27]

Videococa: Video-text modeling with zero-shot transfer from contrastive captioners

Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, and Jiahui Yu. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979,

work page arXiv
[28]

Uni- versalrag: Retrieval-augmented generation over multiple corpora with diverse modalities and granularities

Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, and Sung Ju Hwang. Uni- versalrag: Retrieval-augmented generation over multiple corpora with diverse modalities and granularities. arXiv preprint arXiv:2504.20734,

work page arXiv
[29]

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Momentseeker: A comprehensive benchmark and a strong baseline for moment retrieval within long videos

Huaying Yuan, Jian Ni, Yueze Wang, Junjie Zhou, Zhengyang Liang, Zheng Liu, Zhao Cao, Zhicheng Dou, and Ji-Rong Wen. Momentseeker: A comprehensive benchmark and a strong baseline for moment retrieval within long videos. arXiv preprint arXiv:2502.12558,

work page arXiv
[31]

Direct preference optimization of video large multimodal models from language model reward

Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, et al. Direct preference optimization of video large multimodal models from language model reward. arXiv preprint arXiv:2404.01258, 2024a. Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, ...

work page arXiv
[32]

Following JSFusion (Yu et al., 2018), we sampled 1K clip-text pairs to incorporate into our benchmark

is a dataset composed of 10K open-domain videos, each video clip ranging from 10 to 32 seconds in length and accompanied by a total of 200K captions. Following JSFusion (Yu et al., 2018), we sampled 1K clip-text pairs to incorporate into our benchmark. The query side contains both the instruction and the video caption, while the candidates consist of all ...

work page 2018
[33]

paragraph-to-video

consists of 10K videos collected from Flickr, each trimmed to a maximum of 30 seconds. Each video includes approximately 3 to 5 anno- tated pairs of descriptions and their corresponding distinct moments. Following previous work (Liu et al., 2019; Luo et al., 2021), we concatenate these descriptions and perform “paragraph-to-video” retrieval on this benchm...

work page 2019
[34]

Each video is annotated with high-quality labels for both query-based video moment retrieval and highlight detection

is a dataset comprising 10K videos collected from YouTube, covering a diverse range of topics. Each video is annotated with high-quality labels for both query-based video moment retrieval and highlight detection. In our embedding benchmark, we adopt the standard practice of ranking candidate clips and evaluating performance using Recall@1. In contrast, th...

work page 2017
[35]

Containing 1.8K queries, MomentSeeker consists of 4 subtasks with various query-side modalities

is a dataset designed to benchmark multimodal retrievers on long video moment retrieval tasks. Containing 1.8K queries, MomentSeeker consists of 4 subtasks with various query-side modalities. Additionally, MomentSeeker spans a diverse range of topics, including egocentric videos, cartoons, sports, and movies. For each query, we uniformly sampled nine nega...

work page 2020
[36]

Each question is grounded in a 3- minute clip and targets long-range temporal reasoning

is a diagnostic benchmark for long-form video under- standing, constructed from Ego4D and comprising over 5,000 multiple-choice QA pairs spanning more than 250 hours of egocentric video. Each question is grounded in a 3- minute clip and targets long-range temporal reasoning. In our study, we use a subset of 500 questions for which answer annotations are p...

work page 2024