arxiv: 2509.24621 · v2 · submitted 2025-09-29 · 💻 cs.CV

FreeRet: MLLMs as Training-Free Retrievers

Yuhan Zhu , Xiangyu Zeng , Chenting Wang , Xinhao Li , Chunxu Liu , Yicheng Xu , Ziang Yan , Yi Wang

show 1 more author

Limin Wang

This is my paper

Pith reviewed 2026-05-18 12:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal retrievaltraining-freeMLLMsembeddingsrerankingRAGMMEB benchmarkplug-and-play

0 comments

The pith

Off-the-shelf MLLMs can serve as powerful multimodal retrievers without any training by deriving faithful embeddings for search and using reasoning for reranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether multimodal large language models already possess the capabilities needed for effective retrieval across different data types. It introduces FreeRet as a plug-and-play method that first pulls semantically grounded embeddings directly from the model for quick candidate selection and then applies the model's reasoning for precise reranking. This training-free approach is evaluated on the MMEB and MMEB-V2 benchmarks that together cover 46 datasets, where it beats models that were trained on millions of example pairs. A sympathetic reader would care because the result suggests that complex multimodal retrieval systems could be built using a single pretrained model without separate training stages or loss of its original abilities.

Core claim

FreeRet shows that any off-the-shelf MLLM can function as a two-stage retriever without additional training: it bypasses lexical alignment layers and conditions representation generation on explicit priors to produce semantically faithful embeddings for fast candidate search, then applies neutral choice framing to reduce framing effects while using the model's reasoning for accurate reranking. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, this method substantially outperforms models trained on millions of pairs. The framework is model-agnostic, scales across families and sizes, preserves generative capabilities, supports arbitrary modality combinations, and unifies retrieval, rer킹

What carries the argument

The FreeRet two-stage framework that derives semantically grounded embeddings by bypassing lexical alignment and conditioning on priors, followed by reasoning-based reranking with neutral choice framing.

Load-bearing premise

That off-the-shelf MLLMs already contain semantically faithful embeddings and reliable reasoning capabilities that can be directly harnessed for retrieval without any post-hoc training or alignment adjustments.

What would settle it

A new multimodal retrieval benchmark on which FreeRet underperforms models trained on large contrastive datasets, or where removing the reasoning reranking step causes a large drop in accuracy.

Figures

Figures reproduced from arXiv: 2509.24621 by Chenting Wang, Chunxu Liu, Limin Wang, Xiangyu Zeng, Xinhao Li, Yicheng Xu, Yi Wang, Yuhan Zhu, Ziang Yan.

**Figure 1.** Figure 1: Comparison between prior post-training retrievers and our FreeRet. (a) Existing methods rely on extensive data curation and costly fine-tuning to construct separate embedding and reranking modules. (b) FreeRet directly employs MLLMs as unified embedders and rerankers without any extra training. (c) On the MMEB benchmark covering 36 datasets, FreeRet outperforms models trained on millions of pairs and match… view at source ↗

**Figure 2.** Figure 2: Probing experiments on lexicalization pressure. Results for 3B and 32B variants are provided in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Word-level probability visualization for the output “One Word” of different methods. The top-left panel shows the input example (from N24News (Wang et al., 2021)). Remedy. Building on these findings, we propose a simple yet effective fix: discard the final MLP layer when producing embeddings. This choice retains the high-level abstractions encoded in deeper layers while avoiding the distortion caused by le… view at source ↗

**Figure 4.** Figure 4: LLM framing effect on benchmark accuracy (left) and inherent lexical biases in contextfree response modes (right). One would expect these to be interchangeable, since each simply encodes a positive/negative decision. However, the model achieved 5.0% lower accuracy with Right/Wrong than with True/False. What drives this sensitivity? We posit it stems from imbalances inherited from pretraining corpora. Wor… view at source ↗

**Figure 5.** Figure 5: Varying the number of reranking candidates. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: FreeRet enables instant omni-modal retrieval with omni-modality models. Illustrated with Qwen2.5-Omni: audio-to-video retrieval (left); image+text to video retrieval (right). 4.4 DISCUSSIONS ON TRAINING-FREE ADVANTAGES Instant Deployment. A key strength of the training-free paradigm is its ability to turn any MLLM into a retriever immediately, with no additional fine-tuning. This property allows practition… view at source ↗

**Figure 7.** Figure 7: Qwen2.5-VL 3B and 32B results in probing experiments on lexicalization pressure. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FreeRet claims off-the-shelf MLLMs can do strong training-free retrieval on MMEB-scale tasks via layer bypass for embeddings plus reasoning reranking, but the first-stage embedding quality is not isolated enough to fully support the headline gains.

read the letter

The main thing here is that this paper argues you can skip training entirely and still get competitive multimodal retrieval from any MLLM by pulling embeddings after bypassing lexical alignment layers, conditioning them with explicit priors, and then using neutral framing when the model reranks candidates with its own reasoning. That combination is the concrete new piece, and the work does a reasonable job showing the approach stays model-agnostic across families and sizes while keeping the original generative abilities intact. The unification of retrieval, reranking, and generation inside one model without extra fine-tuning is a practical plus for RAG setups that want to stay lightweight. The reported results on MMEB and MMEB-V2 across 46 datasets claim clear outperformance over models trained on millions of pairs, which would matter if the numbers are solid. The soft spots sit mostly in the experimental grounding. The stress-test concern lands: there is no clear separate measurement of how well the bypassed-layer embeddings alone rank relevant items by cosine similarity before reranking kicks in. Without that isolation, it is hard to tell whether the first stage is genuinely producing a useful metric space or whether the reranking step is carrying most of the load on a weaker candidate set. The abstract also stays light on exact baselines, error bars, and implementation choices, so the full paper needs to supply those details to make the central claim stick. This is aimed at people building practical multimodal retrieval or RAG systems who want to avoid contrastive training costs. Readers who care about generalist models and lower compute would find the ideas and the scale of the evaluation useful. The work shows clear enough thinking on the problem and honest engagement with existing MLLM capabilities to deserve a serious referee, even with the gaps in isolating the embedding step. I would send it for peer review to get direct feedback on tightening the first-stage verification and confirming the numbers hold under closer scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces FreeRet, a plug-and-play, training-free framework that converts any off-the-shelf MLLM into a two-stage retriever. The first stage derives embeddings for fast candidate search by bypassing lexical alignment layers and conditioning on explicit priors; the second stage uses the MLLM's reasoning for reranking with neutral choice framing to mitigate framing effects. The approach is presented as model-agnostic, modality-flexible, and capable of unifying retrieval, reranking, and generation in a single model. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet is claimed to substantially outperform models trained on millions of pairs while preserving generative capabilities.

Significance. If the results hold under rigorous verification, the work would be significant for showing that pretrained MLLMs already encode retrieval-friendly representations that can be directly harnessed without contrastive fine-tuning. This could reduce the need for separate retrieval-specific training pipelines and support end-to-end RAG systems within unified multimodal models, with potential impact on generalist AI architectures.

major comments (2)

[Experimental Evaluation] Experimental section: The headline claim of substantial outperformance on MMEB/MMEB-V2 lacks reported details on exact baselines (including their training data volume and architectures), statistical significance tests, error bars, or ablation on the contribution of each component (bypassing layers vs. priors vs. reranking). Without these, it is difficult to isolate whether gains stem from the proposed method or from implementation choices.
[Embedding Derivation] Section describing the embedding stage: The assumption that bypassing lexical alignment layers produces embeddings whose cosine similarities reliably rank semantic relevance is load-bearing for the first-stage recall. No independent zero-shot retrieval metrics (e.g., recall@K on a held-out subset prior to reranking) are provided to validate embedding quality, leaving open the possibility that the reranker is compensating for a weak candidate pool.

minor comments (2)

[Abstract] Abstract and introduction: Quantify the claimed 'substantial' improvements with specific metrics or relative gains rather than qualitative language.
[Method] Clarify the precise formulation of 'explicit priors' and 'neutral choice framing' with pseudocode or a small example to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional experimental details and validations as suggested.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental section: The headline claim of substantial outperformance on MMEB/MMEB-V2 lacks reported details on exact baselines (including their training data volume and architectures), statistical significance tests, error bars, or ablation on the contribution of each component (bypassing layers vs. priors vs. reranking). Without these, it is difficult to isolate whether gains stem from the proposed method or from implementation choices.

Authors: We agree that these details strengthen the presentation. In the revised manuscript, we have added a table specifying all baselines with their exact architectures and training data volumes. We now report results with error bars computed over three independent runs and include p-values from paired statistical significance tests against the strongest baselines. We have also expanded the ablation study to isolate the contributions of bypassing lexical alignment layers, explicit priors, and the reranking stage separately. revision: yes
Referee: [Embedding Derivation] Section describing the embedding stage: The assumption that bypassing lexical alignment layers produces embeddings whose cosine similarities reliably rank semantic relevance is load-bearing for the first-stage recall. No independent zero-shot retrieval metrics (e.g., recall@K on a held-out subset prior to reranking) are provided to validate embedding quality, leaving open the possibility that the reranker is compensating for a weak candidate pool.

Authors: We acknowledge this concern. The revised manuscript now includes independent zero-shot retrieval metrics (recall@K at multiple K values) computed on held-out subsets using only the first-stage embeddings, prior to reranking. These results show that the embeddings achieve competitive initial recall, confirming that the reranker operates on a reasonably strong candidate pool rather than compensating for deficiencies in the embedding stage. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework validated on benchmarks

full rationale

The paper presents FreeRet as a plug-and-play, training-free method that derives embeddings by bypassing lexical alignment layers in off-the-shelf MLLMs and uses the model's reasoning for reranking. All central claims of outperformance are grounded in direct experimental results on the MMEB and MMEB-V2 benchmarks spanning 46 datasets, rather than any mathematical derivations, predictions, or first-principles results that reduce to the inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked in a load-bearing way that would create circularity. The approach is model-agnostic and empirically falsifiable, making the reported findings self-contained without tautological reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach appears to rely on capabilities already present in pretrained MLLMs.

pith-pipeline@v0.9.0 · 5786 in / 1033 out tokens · 29854 ms · 2026-05-18T12:56:48.731362+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

bypassing the final MLP before the LM head... Removing it yields embeddings that better capture underlying meaning
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

On the MMEB and MMEB-V2 benchmarks... FreeRet substantially outperforms models trained on millions of pairs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adapting MLLMs for Nuanced Video Retrieval
cs.CV 2025-12 unverdicted novelty 7.0

Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

mme5: Improving multimodal multilingual embeddings via high-quality synthetic data

Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, and Zhicheng Dou. mme5: Improving multimodal multilingual embeddings via high-quality synthetic data.arXiv preprint arXiv:2502.08468,

work page arXiv
[4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Token prepending: A training-free approach for eliciting better sentence embeddings from llms

10 Preprint Yuchen Fu, Zifeng Cheng, Zhiwei Jiang, Zhonghui Wang, Yafeng Yin, Zhengliang Li, and Qing Gu. Token prepending: A training-free approach for eliciting better sentence embeddings from llms. arXiv preprint arXiv:2412.11556,

work page arXiv
[6]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1),

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Breaking the modality barrier: Universal embedding learning with multimodal llms

Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, and Jiankang Deng. Breaking the modality barrier: Universal em- bedding learning with multimodal llms.arXiv preprint arXiv:2504.17432,

work page arXiv
[8]

Scaling sentence embeddings with large language models.arXiv preprint arXiv:2307.16645,

Ting Jiang, Shaohan Huang, Zhongzhi Luan, Deqing Wang, and Fuzhen Zhuang. Scaling sentence embeddings with large language models.arXiv preprint arXiv:2307.16645,

work page arXiv
[9]

E5-V: Universal Embeddings with Multimodal Large Language Models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580, 2024a. Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive mul...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Modality curation: Building universal embeddings for advanced multimodal information retrieval

Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Fuzheng Zhang, Guorui Zhou, et al. Modality curation: Building universal embeddings for advanced multimodal information retrieval.arXiv preprint arXiv:2505.19650,

work page arXiv
[11]

Llave: Large language and vision embedding models with hardness-weighted contrastive learning

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812,

work page arXiv
[12]

Meta-task prompting elicits embeddings from large language models.arXiv preprint arXiv:2402.18458,

Yibin Lei, Di Wu, Tianyi Zhou, Tao Shen, Yu Cao, Chongyang Tao, and Andrew Yates. Meta-task prompting elicits embeddings from large language models.arXiv preprint arXiv:2402.18458,

work page arXiv
[13]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Your mixture-of-experts llm is secretly an embedding model for free

Ziyue Li and Tianyi Zhou. Your mixture-of-experts llm is secretly an embedding model for free. arXiv preprint arXiv:2410.10814,

work page arXiv
[15]

Mm-embed: Universal multimodal retrieval with multimodal llms

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571,

work page arXiv
[16]

Idmr: Towards instance-driven precise visual correspondence in multimodal retrieval.arXiv preprint arXiv:2504.00954, 2025a

11 Preprint Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, and Chaochao Lu. Idmr: Towards instance-driven precise visual correspondence in multimodal retrieval.arXiv preprint arXiv:2504.00954, 2025a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information proc...

work page arXiv
[17]

Lamra: Large multimodal model as your advanced retrieval assistant

Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 4015–4025, 2025b. Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, and Liqiang Nie. Puma: Layer-prun...

work page arXiv
[18]

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590,

work page internal anchor Pith review arXiv
[19]

Abc: Achieving better control of multimodal embeddings using vlms.arXiv preprint arXiv:2503.00329,

Benjamin Schneider, Florian Kerschbaum, and Wenhu Chen. Abc: Achieving better control of multimodal embeddings using vlms.arXiv preprint arXiv:2503.00329,

work page arXiv
[20]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Rep- etition improves language model embeddings.arXiv preprint arXiv:2402.15449,

Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. Rep- etition improves language model embeddings.arXiv preprint arXiv:2402.15449,

work page arXiv
[22]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Geneol: Harnessing the generative power of llms for training-free sentence embeddings.arXiv preprint arXiv:2410.14635,

Raghuveer Thirukovalluru and Bhuwan Dhingra. Geneol: Harnessing the generative power of llms for training-free sentence embeddings.arXiv preprint arXiv:2410.14635,

work page arXiv
[24]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Shiyu Huang, Bin Xu, Yuxiao Dong, Min...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Mieb: Massive image embedding benchmark.arXiv preprint arXiv:2504.10471,

Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, M ´arton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, and Niklas Muennighoff. Mieb: Massive image embedding benchmark.arXiv preprint arXiv:2504.10471,

work page arXiv
[26]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025a. Mingjun Xu, Jinhan Dong, Jue Hou, Zehui Wang, Sihang Li, Zhifeng Gao, Renxin Zhong, and Hengxing Cai. Mm-r5: Multimodal reasoning-enhanced reranker via reinforcement le...

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Cafe: Unifying representation and generation with contrastive-autoregressive finetuning.arXiv preprint arXiv:2503.19900,

Hao Yu, Zhuokai Zhao, Shen Yan, Lukasz Korycki, Jianyu Wang, Baosheng He, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, and Hanchao Yu. Cafe: Unifying representation and generation with contrastive-autoregressive finetuning.arXiv preprint arXiv:2503.19900,

work page arXiv
[28]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Bowen Zhang, Kehua Chang, and Chunping Li. Simple techniques for enhancing sentence embed- dings in generative language models. InInternational Conference on Intelligent Computing, pp. 52–64. Springer, 2024a. Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universa...

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Guiding cross-modal represen- tations with mllm priors via preference alignment.arXiv preprint arXiv:2506.06970,

Pengfei Zhao, Rongbo Luan, Wei Zhang, Peng Wu, and Sifeng He. Guiding cross-modal represen- tations with mllm priors via preference alignment.arXiv preprint arXiv:2506.06970,

work page arXiv
[30]

Megapairs: Massive data synthesis for universal multimodal retrieval

Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, and Yongping Xiong. Megapairs: Massive data synthesis for universal multimodal retrieval.arXiv preprint arXiv:2412.14475,

work page arXiv
[31]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Reply only with ‘Yes’ or ‘No’

subset from the MMEB benchmark. To ensure robustness, we rephrase the shared prefix (e.g., the prompt question) three times and report the average accuracy across these variants. For the context-free instruction setting, we further mitigate position-related biases by swapping the order of the labels. For instance, we alternate between instructions such as...

work page 2024