pith. machine review for the scientific record. sign in

arxiv: 2604.22280 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal embeddingsgenerative embeddingsrewrite methodchain-of-thoughtmultimodal retrievallarge language modelsreinforcement learning
0
0 comments X

The pith

Rewrite replaces chain-of-thought to create stronger generative multimodal embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing chain-of-thought reasoning with a retrieval-friendly rewrite in multimodal large language models to produce better embeddings. This unified approach jointly trains generation and embedding tasks to cut redundant steps and reduce ambiguity during retrieval. Additional components align generative and discriminative embedding spaces for flexible trade-offs and use reinforcement learning anchored by stable embeddings to refine the process. Experiments on multiple benchmarks show gains in performance alongside shorter reasoning traces. If correct, this would make generative embeddings more practical for real-world multimodal retrieval without the length or clarity costs of standard reasoning chains.

Core claim

RIME is a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite, bridged by Cross-Mode Alignment for mutual retrieval between generative and discriminative spaces and guided by Refine-RL that anchors optimization to discriminative embeddings, resulting in substantial outperformance over prior generative models on MMEB-V2, MRMR, and UVRB while shortening the length of thinking.

What carries the argument

The retrieval-friendly rewrite step that acts as the central interface to jointly optimize generative and embedding objectives while preserving semantic content for downstream tasks.

If this is right

  • Generative embeddings become usable in broader retrieval scenarios without introducing semantic ambiguity from long reasoning.
  • Models gain the ability to trade off efficiency against accuracy through alignment of the two embedding spaces.
  • Reinforcement learning for generation gains stability by treating discriminative embeddings as fixed semantic anchors.
  • Shorter reasoning traces lower computational cost while sustaining or increasing task accuracy on embedding benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rewrite interface might extend beyond embeddings to other generative multimodal tasks such as captioning or question answering.
  • Further tests on zero-shot or cross-modal retrieval could show whether the gains hold when query distributions shift from the training benchmarks.
  • Integrating the rewrite with additional length penalties might yield even more compact outputs without further performance loss.

Load-bearing premise

The rewrite step preserves necessary semantic information for downstream retrieval while remaining retrieval-friendly.

What would settle it

Compare retrieval accuracy on held-out multimodal queries using RIME rewrite outputs versus direct chain-of-thought outputs; a drop below prior generative baselines would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2604.22280 by Bo Lin, Bosong Chai, Chenxi Zhao, Dacheng Yin, Feipeng Ma, Fengyun Rao, Hebei Li, Jie Chen, Jing Lyu, Junjie Zhou, Ke Mei, Peixi Wu, Shannan Yan, Tianyi Wang, Xiaoyan Sun, Yansong Peng, Zhangchi Hu, Zhibin Lan.

Figure 1
Figure 1. Figure 1: An overview of the RIME pipeline. RIME utilizes rewrite SFT and refine RL to build generative embeddings, where view at source ↗
Figure 2
Figure 2. Figure 2: The complete multimodal rewrite template and the view at source ↗
Figure 3
Figure 3. Figure 3: Overview of RIME framework. The framework comprises three core components: (a) Rewrite-Driven Joint SFT that view at source ↗
Figure 4
Figure 4. Figure 4: Text-to-Rewrite (T2R) and Image-to-Rewrite (I2R) Prompt Example. view at source ↗
Figure 5
Figure 5. Figure 5: Image-Text VQA to Rewrite (IT2R-VQA) and Image-Text Description to Rewrite (IT2R-DESC) Prompt Example. view at source ↗
Figure 6
Figure 6. Figure 6: Examples of Data Construction view at source ↗
Figure 7
Figure 7. Figure 7: Examples of Data Construction view at source ↗
Figure 8
Figure 8. Figure 8: Examples of Data Construction view at source ↗
Figure 9
Figure 9. Figure 9: Examples of Data Construction view at source ↗
Figure 10
Figure 10. Figure 10: Examples of Data Construction view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Rewrite-driven Multimodal Embedding (RIME), a unified framework replacing Chain-of-Thought reasoning with a retrieval-friendly rewrite step for generative multimodal embeddings. It introduces Cross-Mode Alignment (CMA) to bridge generative and discriminative embedding spaces for flexible mutual retrieval and Refine Reinforcement Learning (Refine-RL) that anchors rewrite optimization to stable discriminative embeddings. Experiments on MMEB-V2, MRMR, and UVRB are reported to show substantial outperformance over prior generative embedding models while reducing thinking length.

Significance. If the empirical results and ablations hold, RIME could offer a more efficient interface for generative multimodal embeddings by mitigating CoT redundancy and ambiguity, with CMA enabling trade-offs between efficiency and accuracy. The joint optimization via Refine-RL provides a concrete mechanism for aligning generative outputs to retrieval-friendly semantics.

major comments (2)
  1. [Abstract] Abstract: the claim of substantial outperformance on MMEB-V2, MRMR, and UVRB supplies no quantitative results, baselines, error bars, or ablation details, preventing evaluation of the central empirical claim from the provided text.
  2. [Abstract / §4] Abstract / §4 (Experiments): the assertion that the rewrite step preserves task-relevant semantics while remaining retrieval-friendly is not isolated from CMA alignment or Refine-RL anchoring; no controlled ablation, information-theoretic argument, or formal derivation separates the rewrite's contribution, which is load-bearing for attributing observed gains to the proposed interface rather than auxiliary objectives.
minor comments (1)
  1. [Abstract] Abstract: 'significantly reducing the length of thinking' lacks specific metrics (e.g., token counts or step reductions versus CoT baselines) that would clarify the efficiency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our paper. We have carefully considered each comment and provide point-by-point responses below, along with planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of substantial outperformance on MMEB-V2, MRMR, and UVRB supplies no quantitative results, baselines, error bars, or ablation details, preventing evaluation of the central empirical claim from the provided text.

    Authors: We agree with the referee that the abstract lacks specific quantitative details, which would help readers quickly assess the claims. In the revised manuscript, we will update the abstract to include key performance numbers (e.g., average improvements on MMEB-V2, MRMR, and UVRB), mention the main baselines, and note that results include error bars from multiple runs. This addresses the evaluation concern directly. revision: yes

  2. Referee: [Abstract / §4] Abstract / §4 (Experiments): the assertion that the rewrite step preserves task-relevant semantics while remaining retrieval-friendly is not isolated from CMA alignment or Refine-RL anchoring; no controlled ablation, information-theoretic argument, or formal derivation separates the rewrite's contribution, which is load-bearing for attributing observed gains to the proposed interface rather than auxiliary objectives.

    Authors: We appreciate this point regarding the isolation of the rewrite step's contribution. The current experiments in §4 include ablations for the overall framework, but to more precisely separate the rewrite's role from CMA and Refine-RL, we will add a dedicated controlled experiment in the revision. This will involve comparing the full RIME against a variant where the rewrite is replaced by standard generation while retaining the other components. We will also include a short discussion on the semantic preservation properties of the rewrite based on embedding similarity metrics. This will better attribute the gains to the proposed rewrite interface. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmarks, not self-referential derivations

full rationale

The paper introduces RIME as a framework combining a retrieval-friendly rewrite interface, Cross-Mode Alignment (CMA), and Refine-RL, but supplies no equations, first-principles derivations, or parameter-fitting steps that reduce to their own inputs. Central performance claims are supported solely by empirical results on MMEB-V2, MRMR, and UVRB; the rewrite's semantic-preservation property is asserted as a design choice rather than derived from prior fitted quantities or self-citations. No load-bearing self-citation chains, ansatz smuggling, or uniqueness theorems appear in the provided text. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities beyond the named framework components.

pith-pipeline@v0.9.0 · 5534 in / 884 out tokens · 30172 ms · 2026-05-08T12:43:13.536423+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 35 canonical work pages · 10 internal anchors

  1. [1]

    Billel Aklouche, Ibrahim Bounhas, and Yahya Slimani. 2023. A discriminative method for global query expansion and term reweighting using co-occurrence graphs.Journal of Information Science49, 1 (2023), 183–206

  2. [2]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, et al. 2025. Qwen3-VL Technical Report. arXiv preprint arXiv:2511.21631(2025)

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

  4. [4]

    Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval.Acm Computing Surveys (CSUR)44, 1 (2012), 1–50

  5. [5]

    Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, and Zhicheng Dou. 2025. MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings.arXiv preprint arXiv:2506.23115 (2025)

  6. [6]

    Xuanming Cui, Hong-You Chen, Hao Yu, Hao Yuan, Zihao Wang, Shlok Kumar Mishra, Hanchao Yu, Yonghuan Yang, Jun Xiao, Ser-Nam Lim, et al. 2025. Rea- son to Contrast: A Cascaded Multimodal Retrieval Framework.arXiv preprint arXiv:2602.23369(2025)

  7. [7]

    Xuanming Cui, Jianpeng Cheng, Hong-you Chen, Satya Narayan Shukla, Ab- hijeet Awasthi, Xichen Pan, Chaitanya Ahuja, Shlok Kumar Mishra, Yonghuan Yang, Jun Xiao, et al. 2025. Think then embed: Generative context improves multimodal embedding.arXiv preprint arXiv:2510.05014(2025)

  8. [8]

    Jingcheng Deng, Zhongtao Jiang, Liang Pang, Zihao Wei, Liwei Chen, Kun Xu, Yang Song, Huawei Shen, and Xueqi Cheng. 2025. Following the autoregressive nature of llm embeddings via compression and alignment. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 12672– 12688

  9. [9]

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449(2024)

  10. [10]

    Yifan Gao, Chenyan Xiong, Xinyu Gao, Bofan Jia, Zhuyun Dai, Haoyu Chen, Zhiyuan Liu, Yiqun Wang, and William Yang Wang. 2022. Query Expansion by Pseudo-Relevance Feedback with Conditional Generative Models. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 5574–5585

  11. [11]

    Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, and Jiankang Deng. 2025. Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs. arXiv preprint arXiv:2504.17432(2025)

  12. [12]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  13. [13]

    Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Xiaowen Chu. 2025. Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum.arXiv preprint arXiv:2510.27571(2025)

  14. [14]

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision- language representation learning with noisy text supervision. InInternational conference on machine learning. PMLR, 4904–4916

  15. [15]

    Haonan Jiang, Yuji Wang, Yongjie Zhu, Xin Lu, Wenyu Qin, Meng Wang, Pengfei Wan, and Yansong Tang. 2026. Embed-RL: Reinforcement Learning for Reasoning- Driven Multimodal Embeddings.arXiv preprint arXiv:2602.13823(2026)

  16. [16]

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. 2024. E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580(2024)

  17. [17]

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. 2024. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160(2024)

  18. [18]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

  19. [19]

    Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Fuzheng Zhang, Guorui Zhou, et al. 2025. Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval.arXiv preprint arXiv:2505.19650(2025)

  20. [20]

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. 2025. Llave: Large language and vision embedding models with hardness-weighted con- trastive learning.arXiv preprint arXiv:2503.04812(2025)

  21. [21]

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. 2025. UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings.arXiv preprint arXiv:2511.00405(2025)

  22. [22]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. LLaVA-OneVision: Easy Visual Task Transfer.arXiv preprint arXiv:2408.03326(2024)

  23. [23]

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024. LLaVA-Next-Interleave: Tackling Multi-Image, Video, and 3D in Large Multimodal Models.arXiv preprint arXiv:2407.07895(2024)

  24. [24]

    Hang Li, Ahmed Mourad, Shengyao Zhuang, Bevan Koopman, and Guido Zuccon

  25. [25]

    Pseudo relevance feedback with deep language models and dense retrievers: Successes and pitfalls.ACM Transactions on Information Systems41, 3 (2023), 1–40

  26. [26]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

  27. [27]

    Xiaoyu Liang, Yuchen Peng, Jiale Luo, Wenhao Wang, Haoji Hu, and Xincheng Zhou. 2026. Learn Before Represent: Bridging Generative and Contrastive Learn- ing for Domain-Specific LLM Embeddings.arXiv preprint arXiv:2601.11124(2026)

  28. [28]

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catan- zaro, and Wei Ping. 2024. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571(2024)

  29. [29]

    Yijie Lin, Guofeng Ding, Haochen Zhou, Haobin Li, Mouxing Yang, and Xi Peng

  30. [30]

    ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge.arXiv preprint arXiv:2602.09839(2026)

  31. [31]

    Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, and Chaochao Lu. 2025. Idmr: Towards instance-driven precise visual correspondence in multimodal retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision. 6320–6329

  32. [32]

    Chunxu Liu, Jiyuan Yang, Ruopeng Gao, Yuhan Zhu, Feng Zhu, Rui Zhao, and Limin Wang. 2025. Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval.arXiv preprint arXiv:2511.16150(2025)

  33. [33]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning. InAdvances in Neural Information Processing Systems, Vol. 36. 34892–34916

  34. [34]

    Qianying Liu, Xiao Liang, Zhiqiang Zhang, Zhongfei Qing, Fengfan Zhou, Yibo Chen, Xu Tang, Yao Hu, and Paul Henderson. 2025. ReMatch: Boost- ing Representation through Matching for Multimodal Retrieval.arXiv preprint arXiv:2511.19278(2025)

  35. [35]

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yan- feng Wang, and Weidi Xie. 2025. LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4015–4025

  36. [36]

    Xueguang Ma, Keshav Santhanam, Simran Arora, Omar Khattab, Hannaneh Hajishirzi, Sewoong Srinivas, Christopher Potts, and Sean Welleck. 2023. Query Rewriting for Retrieval-Augmented Large Language Models.arXiv preprint arXiv:2305.14283(2023)

  37. [37]

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al . 2025. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590(2025)

  38. [38]

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748(2018)

  39. [39]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

  40. [40]

    In International conference on machine learning

    Learning transferable visual models from natural language supervision. In International conference on machine learning. PmLR, 8748–8763

  41. [41]

    Omri Uzan, Asaf Yehudai, Eyal Shnarch, Ariel Gera, et al. 2025. Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization.arXiv preprint arXiv:2510.05038(2025)

  42. [42]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

  43. [43]

    Xiao Wang, Craig Macdonald, Nicola Tonellotto, and Iadh Ounis. 2023. ColBERT- PRF: Semantic pseudo-relevance feedback for dense passage and document retrieval.ACM Transactions on the Web17, 1 (2023), 1–39

  44. [44]

    Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. 2025. Multimodal chain-of-thought reasoning: A com- prehensive survey.arXiv preprint arXiv:2503.12605(2025)

  45. [45]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824–24837

  46. [46]

    Yebo Wu, Feng Liu, Ziwei Xie, Zhiyuan Liu, Changwang Zhang, Jun Wang, and Li Li. 2026. TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings. arXiv preprint arXiv:2603.04772(2026)

  47. [47]

    Yang Xu, Gareth JF Jones, and Bin Wang. 2009. Query dependent pseudo- relevance feedback based on wikipedia. InProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 59– 66. Peixi Wu, Ke Mei et al

  48. [48]

    Hao Yu, Zhuokai Zhao, Shen Yan, Lukasz Korycki, Jianyu Wang, Baosheng He, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, and Hanchao Yu. 2025. CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning. arXiv preprint arXiv:2503.19900(2025)

  49. [49]

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zheng- hao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al . 2024. Visrag: Vision-based retrieval-augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594(2024)

  50. [50]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision. 11975–11986

  51. [51]

    Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander G Hauptmann, Yonatan Bisk, et al. 2025. Direct preference optimization of video large multimodal models from language model reward. InProceedings of the 2025 Conference of the Nations of the Ameri- cas Chapter of the Association for Computational L...

  52. [52]

    Siyue Zhang, Yuan Gao, Xiao Zhou, Yilun Zhao, Tingyu Song, Arman Cohan, Anh Tuan Luu, and Chen Zhao. 2025. MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval. arXiv preprint arXiv:2510.09510(2025)

  53. [53]

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. 2024. GME: im- proving universal multimodal retrieval by multimodal LLMs.arXiv preprint arXiv:2412.16855(2024)

  54. [54]

    Junjie Zhou, Ze Liu, Lei Xiong, Jin-Ge Yao, Yueze Wang, Shitao Xiao, Fenfen Lin, Miguel Hu Chen, Zhicheng Dou, Siqi Bao, et al. 2025. MR2-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval.arXiv preprint arXiv:2509.26378 (2025)

  55. [55]

    Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian. 2025. MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval. (2025), 19076–19095

  56. [56]

    Artificial Intelligence

    Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, and Shiqi Wang. 2025. Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval.arXiv preprint arXiv:2510.02745(2025). Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings Beyond Chain-of-Thought: Rewrite as a Universal Interface for...

  57. [57]

    Computer Science: A branch of computer science dedicated to simulating human intelligence through algorithms and computational power

  58. [58]

    Product & Business Context: Software or hardware products and services endowed with intelligent interaction capabilities

  59. [59]

    Artificial Intelligence

    Everyday Usage: Sometimes broadly used to describe any system or phenomenon exhibiting a degree of autonomous decision-making. Step 3: Summary "Artificial Intelligence" is a cross-disciplinary concept spanning computer science, mathematics, and cognitive science, whose meaning varies with context and requires interpretation based on the specific scenario....