pith. machine review for the scientific record. sign in

arxiv: 2604.02073 · v3 · submitted 2026-04-02 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PLUME: Latent Reasoning Based Universal Multimodal Embedding

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords universal multimodal embeddinglatent reasoningchain-of-thoughtmultimodal retrievallatent statesinference efficiencyMMEB benchmarksemantic anchor adapter
0
0 comments X

The pith

Short latent-state rollouts replace explicit chain-of-thought in universal multimodal embedding and deliver over 30x faster inference while improving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to replace the slow, text-heavy chain-of-thought step used in recent universal multimodal embedding models with a compact sequence of continuous hidden-state computations. It claims this latent rollout, steered by a guided transition adapter and trained through an explicit-to-latent curriculum, preserves the intent-inference benefit of reasoning without forcing rich multimodal evidence through a narrow verbal bottleneck. On the 78-task MMEB-v2 benchmark the resulting model beats strong explicit-CoT baselines while cutting reasoning from hundreds of tokens to fewer than ten latent steps. The approach is presented as especially useful for retrieval over dense, structurally complex evidence such as video or visual documents. If the claim holds, retrieval systems could retain intermediate reasoning power without the latency and token cost of generating rationales at inference time.

Core claim

PLUME advances universal multimodal embedding by replacing verbalized chain-of-thought rationales with a short autoregressive rollout of continuous latent states, steered by a semantic-anchor-guided transition adapter that selects different reasoning trajectories under a fixed computation budget; a progressive explicit-to-latent curriculum transfers the reasoning behavior from text generation into hidden-state computation during training, so that explicit CoT is eliminated at inference, yielding higher accuracy on MMEB-v2 with over 30 times faster inference.

What carries the argument

The semantic-anchor-guided transition adapter that steers a short autoregressive rollout of continuous latent states along task-appropriate reasoning trajectories under fixed compute.

If this is right

  • Higher accuracy than explicit-CoT UME baselines across all 78 tasks in MMEB-v2.
  • Reasoning cost reduced from hundreds of generated tokens to fewer than 10 latent steps.
  • Over 30 times faster inference at retrieval time.
  • Particular gains in video and visual-document retrieval where evidence is dense and hard to organize verbally.
  • A training curriculum that gradually removes explicit rationales while preserving their benefit in hidden states.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-rollout pattern could be applied to other multimodal tasks that currently rely on generated explanations, such as visual question answering or captioning.
  • Because the latent states remain continuous, they may support finer-grained similarity comparisons than discrete token sequences allow.
  • Scaling the number of latent steps modestly might further close any remaining gap on tasks that benefit from longer explicit reasoning.
  • The curriculum approach offers a general recipe for distilling slow explicit reasoning into fast hidden-state computation in other language-model families.

Load-bearing premise

A short sequence of continuous latent states can encode the same complex query-intent inference that explicit verbalized chain-of-thought produces, without discarding essential multimodal evidence.

What would settle it

On any MMEB-v2 task where explicit-CoT baselines currently lead, replace the latent rollout with an oracle that forces the model to output the same verbal rationales used by the baselines and measure whether accuracy drops below the latent-only result.

Figures

Figures reproduced from arXiv: 2604.02073 by Chaoyang Zhao, Chenwei He, Haiyun Guo, Jinqiao Wang, Lingxiang Wu, Tianyu Yang, Xiangzhao Hao, Yuheng Jia, Yuxiang Ma.

Figure 1
Figure 1. Figure 1: PLUME achieves a favorable accuracy–efficiency tradeoff on MMEB-v2. The x-axis shows inference throughput on a single H20 GPU and the y-axis shows average MMEB-v2 performance. and data are publicly available at https://github. com/haoxiangzhao12138/PLUME. 1. Introduction Universal multimodal embedding (UME) aims to map het￾erogeneous inputs, including text, images, videos, and vi￾sual documents, into a sha… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of three universal multimodal embedding paradigms. Left: early discriminative UME forms embeddings through [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of PLUME. Starting from a multimodal prefix, PLUME replaces explicit CoT decoding with a compact latent rollout inside the backbone. The bottom panel illustrates the latent rollout process, where the model performs several latent transitions before extracting the final retrieval embedding from the hidden state at <gen>. The top-left panel expands the semantic-anchor-guided transition adapter, whic… view at source ↗
Figure 4
Figure 4. Figure 4: Per task performance comparison on MMEB-v2 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Average cosine similarity between intermediate states [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Activation preferences of specialized experts across im [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Failure cases on ChartQA. Each example shows the input chart and query at the top, and the top retrieved predictions from PLUME at the bottom. The green box marks the ground-truth target, while the preceding orange boxes denote higher-ranked distractors. In Case 1, PLUME fails on a count-over-threshold query, ranking 4 above the correct answer 5. In Case 2, PLUME struggles with multi-step numerical aggrega… view at source ↗
Figure 8
Figure 8. Figure 8: Failure case on InfographicsVQA. The figure shows an infographic-style input, the query, the ground-truth target, and the top retrieved predictions from PLUME. The green box marks the correct target, while the orange boxes denote higher-ranked or nearby distractors. In this example, the target answer is facebook, but PLUME ranks a broader candidate, twitter, facebook, pinterest, above the exact target. It … view at source ↗
Figure 9
Figure 9. Figure 9: Average activation rate of each expert across input [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A case where explicit CoT gets trapped by a misleading textual detail while PLUME recovers the correct target. In this example, explicit CoT over-focuses on the spurious text cue “ARMYJUNK,” which biases the subsequent reasoning trace and leads to an incorrect retrieval result. In contrast, PLUME does not commit to a discrete verbal rationale and still retrieves the correct target half track after latent … view at source ↗
Figure 11
Figure 11. Figure 11: A case where explicit CoT over-compresses the retrieval intent while PLUME preserves the correct target. In this example, explicit CoT summarizes the query into an overly coarse verbal description and is distracted toward an incorrect retrieval result. In contrast, PLUME avoids committing to a discrete textual rationale and successfully retrieves the correct target after latent rollout. 22 [PITH_FULL_IMA… view at source ↗
Figure 12
Figure 12. Figure 12: A case where explicit CoT is misled by a surface-level prior while PLUME recovers the correct answer. In this example, explicit CoT over-relies on the visual prior that egg cartons are typically made of cardboard and therefore predicts an incorrect result. In contrast, PLUME avoids being locked into this explicit verbal path and successfully retrieves the correct answer styrofoam after latent rollout. 23 … view at source ↗
Figure 13
Figure 13. Figure 13: A case where explicit CoT drifts toward an incorrect action description while PLUME recovers the correct target. In this example, explicit CoT over-compresses the video into a coarse description of two pens on a wooden floor and is distracted toward an incorrect retrieval result. In contrast, PLUME avoids committing to this explicit verbal path and successfully retrieves the correct target putting pen sim… view at source ↗
read the original abstract

Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes PLUME, a latent reasoning framework for universal multimodal embedding (UME) that replaces explicit chain-of-thought (CoT) rationales with short autoregressive rollouts of continuous latent states. It introduces a semantic-anchor-guided transition adapter to steer these rollouts under a fixed computation budget and uses a progressive explicit-to-latent curriculum to transfer reasoning behavior from verbalized scaffolds to hidden-state computation during training. On the 78-task MMEB-v2 benchmark, PLUME is claimed to outperform strong explicit-CoT UME baselines while reducing reasoning from hundreds of tokens to fewer than 10 latent steps, yielding over 30x faster inference, with particular advantages for dense-evidence tasks such as video and visual document retrieval.

Significance. If the central performance claims hold under rigorous verification, the work would be significant for practical multimodal retrieval systems by showing that structured latent computation can preserve intermediate reasoning benefits without explicit token generation overhead. It extends existing multimodal models with targeted components and targets a clear efficiency gap in current UME approaches.

major comments (3)
  1. [Methods] Methods section (description of semantic-anchor-guided transition adapter): The claim that the adapter steers the autoregressive latent rollout through equivalent reasoning trajectories to explicit CoT is load-bearing for the headline result, yet no ablation is described that disables the adapter while keeping the same latent rollout length, model backbone, and training budget to isolate its contribution versus architectural side-effects.
  2. [Experiments] Experiments section (MMEB-v2 results and ablations): The outperformance on 78 tasks with <10 latent steps is presented without probing of hidden-state evolution, similarity to verbalized rationales, or isolating ablations on the transition adapter; this leaves open whether gains arise from latent reasoning per se or from training dynamics, especially on dense-evidence tasks.
  3. [Training] Training curriculum subsection: The progressive explicit-to-latent curriculum is described as gradually transferring behavior to hidden-state computation, but no details on the schedule, transition losses, or performance curves during the shift are provided to confirm that explicit CoT is fully eliminated at inference without performance degradation.
minor comments (2)
  1. [Abstract and Experiments] The abstract and main text should include explicit details on the exact data splits, task composition of the 78-task MMEB-v2 benchmark, and baseline implementations for full reproducibility.
  2. [Methods] Notation for latent states, the transition adapter, and the number of latent steps should be formalized with equations to clarify the autoregressive rollout process.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will incorporate revisions to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Methods] Methods section (description of semantic-anchor-guided transition adapter): The claim that the adapter steers the autoregressive latent rollout through equivalent reasoning trajectories to explicit CoT is load-bearing for the headline result, yet no ablation is described that disables the adapter while keeping the same latent rollout length, model backbone, and training budget to isolate its contribution versus architectural side-effects.

    Authors: We agree that an explicit ablation isolating the transition adapter is necessary to substantiate its role. In the revised manuscript we will add this ablation: the adapter is removed while preserving identical latent rollout length, backbone, and training budget, allowing direct comparison of performance and trajectory alignment with the full model. revision: yes

  2. Referee: [Experiments] Experiments section (MMEB-v2 results and ablations): The outperformance on 78 tasks with <10 latent steps is presented without probing of hidden-state evolution, similarity to verbalized rationales, or isolating ablations on the transition adapter; this leaves open whether gains arise from latent reasoning per se or from training dynamics, especially on dense-evidence tasks.

    Authors: We will expand the experiments section with (i) quantitative and qualitative probes of hidden-state evolution across the latent rollout steps, (ii) similarity metrics between latent trajectories and corresponding explicit CoT rationales, and (iii) the adapter ablation described above. These additions will clarify that the observed gains derive from the latent reasoning mechanism rather than training artifacts, with particular attention to dense-evidence tasks. revision: yes

  3. Referee: [Training] Training curriculum subsection: The progressive explicit-to-latent curriculum is described as gradually transferring behavior to hidden-state computation, but no details on the schedule, transition losses, or performance curves during the shift are provided to confirm that explicit CoT is fully eliminated at inference without performance degradation.

    Authors: We will revise the training curriculum subsection to report the exact schedule (number of epochs per stage and mixing ratios), the formulation of the transition losses, and training curves that track both explicit-CoT and latent-only performance throughout the curriculum. These details will demonstrate that explicit CoT is fully removed at inference while performance is preserved or improved. revision: yes

Circularity Check

0 steps flagged

No circularity: method presented as empirical architectural extension without self-referential derivations

full rationale

The paper introduces PLUME as a latent-reasoning replacement for explicit CoT in UME, using a semantic-anchor-guided transition adapter and progressive curriculum, but presents no equations, fitted-parameter predictions, or uniqueness theorems that reduce the claimed benchmark gains to quantities defined inside the same work. Results on MMEB-v2 are reported as direct empirical outcomes of the added components rather than derived predictions. No self-citations are used to justify load-bearing premises, and the approach is framed as building on existing multimodal models with independent training stabilizations. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that latent states can substitute for verbalized reasoning and on the engineering choice of a new transition adapter whose parameters are learned during the curriculum.

free parameters (1)
  • number of latent steps
    Fixed to fewer than 10 to achieve the reported speed-up; value chosen for efficiency rather than derived.
axioms (1)
  • domain assumption Autoregressive evolution of continuous hidden states can perform the same intent-inference function as explicit chain-of-thought text generation.
    Invoked when the paper states that latent rollout replaces verbalized CoT while preserving benefits.
invented entities (1)
  • semantic-anchor-guided transition adapter no independent evidence
    purpose: Steers the latent rollout along different reasoning trajectories under fixed compute budget.
    New module introduced to support diverse multimodal queries.

pith-pipeline@v0.9.0 · 5592 in / 1305 out tokens · 38259 ms · 2026-05-13T21:45:14.456331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/Breath1024.lean period8, neutral8, flipAt512 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    The latent rollout length is set to K=8... reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps... semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget.

  • IndisputableMonolith/Foundation/ArrowOfTime.lean z_monotone_absolute, arrow_from_z echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    PLUME internalizes reasoning for UME into a compact latent process inside the MLLM... short autoregressive rollout of continuous latent states... progressive explicit-to-latent curriculum

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

    cs.AI 2026-05 unverdicted novelty 6.0

    LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 7

  2. [2]

    Curriculum learning

    Yoshua Bengio, J ´erˆome Louradour, Ronan Collobert, and Ja- son Weston. Curriculum learning. InProceedings of the 26th International Conference on Machine Learning, pages 41–48. ACM, 2009. 6

  3. [3]

    Moca: Modality-aware continual pre-training makes better bidirectional multimodal embeddings.arXiv preprint arXiv:2506.23115, 2025

    Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, and Zhicheng Dou. Moca: Modality-aware continual pre-training makes better bidirectional multimodal embeddings.arXiv preprint arXiv:2506.23115, 2025. 3

  4. [4]

    Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.CoRR, abs/2505.16782, 2025

    Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Han- lin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, and Xiaoyu Shen. Reasoning beyond language: A compre- hensive survey on latent chain-of-thought reasoning.CoRR, abs/2505.16782, 2025. 2, 3

  5. [5]

    Think then embed: Generative context improves multimodal embedding.arXiv preprint arXiv:2510.05014, 2025

    Xuanming Cui, Jianpeng Cheng, Hong-you Chen, Satya Narayan Shukla, Abhijeet Awasthi, Xichen Pan, Chaitanya Ahuja, Shlok Kumar Mishra, Yonghuan Yang, Jun Xiao, et al. Think then embed: Generative con- text improves multimodal embedding.arXiv preprint arXiv:2510.05014, 2025. 2, 3, 7

  6. [6]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language mod- els.arXiv preprint arXiv:2407.01449, 2024. 3

  7. [7]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.ArXiv, abs/2101.03961, 2021. 5

  8. [8]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InIn- ternational Conference on Learning Representations (ICLR),

  9. [9]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 1

  10. [10]

    Breaking the modality barrier: Universal embedding learning with multimodal llms.arXiv preprint arXiv:2504.17432, 2025

    Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Wei- dong Cai, and Jiankang Deng. Breaking the modality barrier: Universal embedding learning with multimodal llms.arXiv preprint arXiv:2504.17432, 2025. 3

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

  12. [12]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large lan- guage models to reason in a continuous latent space.CoRR, abs/2412.06769, 2024. 2, 3

  13. [13]

    Trace: Task-adaptive rea- soning and representation learning for universal multimodal retrieval, 2026

    Xiangzhao Hao, Shijie Wang, Tianyu Yang, Tianyue Wang, Haiyun Guo, and Jinqiao Wang. Trace: Task-adaptive rea- soning and representation learning for universal multimodal retrieval, 2026. 2, 3

  14. [14]

    Gaussian error linear units (gelus).arXiv: Learning, 2016

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv: Learning, 2016. 5

  15. [15]

    Unsupervised dense information retrieval with con- trastive learning.Trans

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebas- tian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with con- trastive learning.Trans. Mach. Learn. Res., 2022, 2022. 6

  16. [16]

    Cumulated gain- based evaluation of ir techniques.ACM Transactions on In- formation Systems (TOIS), 20(4):422–446, 2002

    Kalervo J ¨arvelin and Jaana Kek ¨al¨ainen. Cumulated gain- based evaluation of ir techniques.ACM Transactions on In- formation Systems (TOIS), 20(4):422–446, 2002. 7

  17. [17]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

  18. [18]

    Embed- rl: Reinforcement learning for reasoning-driven multimodal embeddings.arXiv preprint arXiv:2602.13823, 2026

    Haonan Jiang, Yuji Wang, Yongjie Zhu, Xin Lu, Wenyu Qin, Meng Wang, Pengfei Wan, and Yansong Tang. Embed- rl: Reinforcement learning for reasoning-driven multimodal embeddings.arXiv preprint arXiv:2602.13823, 2026. 3

  19. [19]

    E5-V: universal embeddings with multi- modal large language models

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580, 2024. 2, 3

  20. [20]

    Jiang, R

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2410.05160, 2024. 1, 2, 3, 4, 7

  21. [21]

    Laser: Internal- izing explicit reasoning into latent space for dense retrieval,

    Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Yutao Zhu, and Zhicheng Dou. Laser: Internal- izing explicit reasoning into latent space for dense retrieval,

  22. [22]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- 10 ceedings of the 29th Symposium on Operating Systems Prin- ciples, SOSP 2023, Koblenz, Germany, October 23-26, 2023, pages 611–626. ACM, 2023. 5

  23. [23]

    Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812, 2025

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812, 2025. 3

  24. [24]

    Ume-r1: Exploring reasoning-driven generative multimodal embeddings.arXiv preprint arXiv:2511.00405, 2025

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jin- song Su. Ume-r1: Exploring reasoning-driven generative multimodal embeddings.arXiv preprint arXiv:2511.00405,

  25. [25]

    Nv- embed: Improved techniques for training llms as generalist embedding models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv- embed: Improved techniques for training llms as generalist embedding models. InThe Thirteenth International Confer- ence on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 3

  26. [26]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1

  27. [27]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

  28. [28]

    arXiv preprint arXiv:2411.02571 , year=

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024. 2, 3

  29. [29]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

  30. [30]

    Lamra: Large multimodal model as your advanced retrieval assistant

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4015–4025, 2025. 3, 7

  31. [31]

    arXiv preprint arXiv:2507.04590 , year=

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al. Vlm2vec-v2: Advancing multimodal em- bedding for videos, images, and visual documents.arXiv preprint arXiv:2507.04590, 2025. 1, 3, 7, 8

  32. [32]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 3, 4, 6

  33. [33]

    Jacob Pfau, William Merrill, and Samuel R. Bowman. Let’s think dot by dot: Hidden computation in transformer lan- guage models.ArXiv, abs/2404.15758, 2024. 3

  34. [34]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 3

  35. [35]

    Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer, 2017

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer, 2017. 5

  36. [36]

    Codi: Compressing chain-of-thought into continuous space via self-distillation

    Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. CODI: compressing chain-of- thought into continuous space via self-distillation.CoRR, abs/2502.21074, 2025. 2, 3

  37. [37]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Processing Systems, 2017. 5

  38. [38]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre- training.CoRR, abs/2212.03533, 2022. 3

  39. [39]

    Improving text embeddings with large language models

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Ran- gan Majumder, and Furu Wei. Improving text embeddings with large language models. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), ACL 2024, Bangkok, Thai- land, August 11-16, 2024, pages 11897–11916. Association for Computational ...

  40. [40]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 7

  41. [41]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,

  42. [42]

    Uniir: Train- ing and benchmarking universal multimodal information re- trievers

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Train- ing and benchmarking universal multimodal information re- trievers. InEuropean Conference on Computer Vision, pages 387–404. Springer, 2024. 3

  43. [43]

    Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2, 3

  44. [44]

    C-pack: Packed resources for general chinese embeddings

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings. InProceedings of the 47th In- ternational ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, pages 641–649. ACM, 2024. 3

  45. [45]

    Llava-cot: Let vision language models reason step-by-step

    Guowei Xu, Peng Jin, Ziang Wu, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2087– 2098, 2025. 3

  46. [46]

    Recall: Recalibrating capability degradation for mllm-based composed image retrieval.arXiv preprint arXiv:2602.01639, 2026

    Tianyu Yang, ChenWei He, Xiangzhao Hao, Tianyue Wang, Jiarui Guo, Haiyun Guo, Leigang Qu, Jinqiao Wang, and Tat-Seng Chua. Recall: Recalibrating capability degradation for mllm-based composed image retrieval.arXiv preprint arXiv:2602.01639, 2026. 2 11

  47. [47]

    Recall: Recalibrating capability degradation for mllm-based composed image retrieval, 2026

    Tianyu Yang, Chenwei He, Xiangzhao Hao, Tianyue Wang, Jiarui Guo, Haiyun Guo, Leigang Qu, Jinqiao Wang, and Tat- Seng Chua. Recall: Recalibrating capability degradation for mllm-based composed image retrieval, 2026. 3

  48. [48]

    Cafe: Unifying representation and gen- eration with contrastive-autoregressive finetuning.arXiv preprint arXiv:2503.19900, 2025

    Hao Yu, Zhuokai Zhao, Shen Yan, Lukasz Korycki, Jianyu Wang, Baosheng He, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, and Hanchao Yu. Cafe: Unifying representation and gen- eration with contrastive-autoregressive finetuning.arXiv preprint arXiv:2503.19900, 2025. 3

  49. [49]

    Visrag: Vision-based retrieval-augmented generation on multi-modality documents

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594, 2024. 3

  50. [50]

    Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629,

    E. Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D. Goodman. Quiet-star: Lan- guage models can teach themselves to think before speaking. ArXiv, abs/2403.09629, 2024. 3

  51. [51]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 3

  52. [52]

    Magi- clens: Self-supervised image retrieval with open-ended in- structions.arXiv preprint arXiv:2403.19651, 2024

    Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magi- clens: Self-supervised image retrieval with open-ended in- structions.arXiv preprint arXiv:2403.19651, 2024. 3

  53. [53]

    Chain of preference optimization: Improving chain-of-thought reasoning in llms.Advances in Neural In- formation Processing Systems, 37:333–356, 2024

    Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, and Min Lin. Chain of preference optimization: Improving chain-of-thought reasoning in llms.Advances in Neural In- formation Processing Systems, 37:333–356, 2024. 3

  54. [54]

    GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855, 2024. 1, 2, 3, 7

  55. [55]

    Bridging modalities: Improving universal mul- timodal retrieval by multimodal large language models

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Bridging modalities: Improving universal mul- timodal retrieval by multimodal large language models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 9274–9285, 2025. 3, 7

  56. [56]

    Megapairs: Massive data synthesis for universal mul- timodal retrieval.arXiv preprint arXiv:2412.14475, 2024

    Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, and Yongping Xiong. Megapairs: Massive data synthesis for universal mul- timodal retrieval.arXiv preprint arXiv:2412.14475, 2024. 3

  57. [57]

    ARMYJUNK,

    Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian. Megapairs: Massive data synthesis for universal multi- modal retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19076–19095, 2025. 3 12 A. Curriculum Ablation De...