arxiv: 2604.11095 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.AI

Recognition: unknown

Bottleneck Tokens for Unified Multimodal Retrieval

Dongxiao Mao, Haohua Zhao, Jiang Shaohua, Jing Ren, Liqing Zhang, Siyu Sun, Weixiong Lin, Xiangyuan Ren, Yiyi Zhang, Yuchao Zheng, Zhaohe Liao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multimodal retrievalbottleneck tokensdecoder-only MLLMssemantic compressiongenerative information condensationunified multimodalcontrastive fine-tuningnext-token prediction

0 comments

The pith

Learnable bottleneck tokens with masked generative training close structural gaps in decoder-only multimodal retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problems of implicit pooling overloading a standard token and the absence of token-level compression guidance in contrastive training for multimodal large language models used in retrieval. It proposes Bottleneck Tokens as learnable explicit pooling elements and Generative Information Condensation, where a condensation mask directs all predictive signals through these tokens during next-token prediction. This converts the generative loss into supervision for semantic compression. If correct, the method delivers improved retrieval performance on a wide range of multimodal tasks with minimal added computation at inference time.

Core claim

Bottleneck Tokens are a small set of learnable tokens that provide explicit fixed-capacity pooling, and when trained under Generative Information Condensation with a Condensation Mask severing direct attention from target tokens to query tokens, the next-token prediction objective supplies dense supervision that forces faithful semantic compression into the bottleneck representations for use as sequence embeddings.

What carries the argument

Bottleneck Tokens (BToks), a small set of learnable tokens serving as fixed-capacity explicit pooling, together with the Condensation Mask that routes all information through them during training.

Load-bearing premise

That routing all signals through the Bottleneck Tokens with the Condensation Mask creates a faithful semantic compression generalizing beyond the training distribution without new failure modes.

What would settle it

A new multimodal retrieval benchmark with out-of-distribution semantics where the bottleneck model underperforms standard pooling or shows loss of fine details would falsify the claim of faithful compression.

Figures

Figures reproduced from arXiv: 2604.11095 by Dongxiao Mao, Haohua Zhao, Jiang Shaohua, Jing Ren, Liqing Zhang, Siyu Sun, Weixiong Lin, Xiangyuan Ren, Yiyi Zhang, Yuchao Zheng, Zhaohe Liao.

read the original abstract

Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., <EOS>) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to query tokens. All predictive signals are thereby forced through the BToks, converting the generative loss into dense, token-level supervision for semantic compression. At inference time, only the input and BToks are processed in a single forward pass with negligible overhead over conventional last-token pooling. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of 59.0 (+3.6 over VLM2Vec-V2) with substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bottleneck tokens plus the condensation mask give a direct fix for bad pooling in decoder-only MLLMs, and the MMEB-V2 numbers look decent, but the experiments still need more controls to pin down what is actually driving the gains.

read the letter

The main idea is straightforward: add a small fixed set of learnable bottleneck tokens and use an attention mask during next-token training so that every predictive signal has to route through those tokens instead of leaking directly from query to target. This turns the generative loss into explicit supervision for compressing semantics into a fixed-capacity representation, which is then used for retrieval at inference time with almost no extra cost. That combination is new enough in the multimodal retrieval setting and directly addresses the EOS-token pooling problem the authors flag. The reported 59.0 overall on MMEB-V2 (78 datasets, three modalities) with a 3.6-point lift over VLM2Vec-V2 and a 12.6-point jump on video QA is the strongest part of the paper; those numbers suggest the approach helps on tasks that need real semantic aggregation. The negligible inference overhead is also a practical win. The soft spots sit in the experimental section. The abstract says the gains hold under comparable data conditions, but without ablations on the mask itself, on the exact number of bottleneck tokens, or on how baselines were matched for training compute and data, it is hard to know how much of the improvement is truly from the new components versus other differences in setup. The concern that the model could still route only partial information through the bottlenecks or pick up task-specific artifacts is still live until the full paper shows those controls or failure-case analysis. This is the kind of paper that belongs in a reading group for people working on retrieval embeddings from decoder-only models. It has a clean architectural proposal and a large-scale benchmark result, so it deserves a serious referee even if the reviewers will push for tighter experiments and more dissection of where the compression actually succeeds or fails.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Bottleneck Tokens (BToks), a small set of learnable tokens serving as explicit fixed-capacity pooling, together with Generative Information Condensation: a next-token prediction objective augmented by a Condensation Mask that severs direct attention paths from target tokens to query tokens, thereby routing all supervision through the BToks. The authors claim this yields state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark (78 datasets, 3 modalities), with an overall score of 59.0 (+3.6 over VLM2Vec-V2) and larger gains on semantically demanding tasks such as Video-QA.

Significance. If the reported gains are attributable to the proposed components rather than uncontrolled experimental factors, the work supplies a concrete architectural and objective-level solution to the mismatch between standard last-token pooling and the requirements of dense retrieval in decoder-only MLLMs. The negligible inference overhead and the conversion of generative loss into token-level compression supervision are attractive features that could be adopted more broadly if the compression is shown to be faithful and generalizable.

major comments (2)

[Method] The central mechanism (Condensation Mask severing target-to-query attention so that all predictive signal passes through the fixed-capacity BToks) is presented as converting next-token prediction into dense supervision for semantic compression. No attention-map analysis, information-flow measurement, or ablation that isolates the mask's effect on leakage versus compression is provided; without such verification the claim that the resulting BTok states encode retrieval-suitable semantics rather than task-specific generative artifacts remains untested.
[Experiments] The SOTA claim on MMEB-V2 (Overall 59.0, +3.6 over VLM2Vec-V2, +12.6 on Video-QA) is load-bearing for the paper's contribution yet is reported without ablation tables for BToks versus standard pooling, without statistical significance tests, and without explicit documentation that baseline training data volume, compute, and initialization were matched. These omissions make it impossible to attribute the gains to the proposed condensation procedure rather than differences in training distribution.

minor comments (1)

[Method] Notation for the Condensation Mask and the precise attention masking pattern should be formalized with an equation or pseudocode diagram to allow exact reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of verification and experimental rigor that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Method] The central mechanism (Condensation Mask severing target-to-query attention so that all predictive signal passes through the fixed-capacity BToks) is presented as converting next-token prediction into dense supervision for semantic compression. No attention-map analysis, information-flow measurement, or ablation that isolates the mask's effect on leakage versus compression is provided; without such verification the claim that the resulting BTok states encode retrieval-suitable semantics rather than task-specific generative artifacts remains untested.

Authors: We agree that direct empirical verification of the information routing would strengthen the methodological contribution. In the revised manuscript we will add (i) attention-map visualizations for representative examples with and without the Condensation Mask, (ii) an information-flow measurement (e.g., average attention mass from target tokens to BToks versus query tokens), and (iii) an ablation that removes the mask while keeping all other components fixed. These additions will quantify the mask's role in preventing leakage and will demonstrate that the resulting BTok representations are retrieval-oriented rather than purely generative. revision: yes
Referee: [Experiments] The SOTA claim on MMEB-V2 (Overall 59.0, +3.6 over VLM2Vec-V2, +12.6 on Video-QA) is load-bearing for the paper's contribution yet is reported without ablation tables for BToks versus standard pooling, without statistical significance tests, and without explicit documentation that baseline training data volume, compute, and initialization were matched. These omissions make it impossible to attribute the gains to the proposed condensation procedure rather than differences in training distribution.

Authors: We acknowledge that stronger documentation is required to support attribution of the observed gains. The original text states results are obtained 'under comparable data conditions,' yet we will expand the experimental section in revision by (i) adding a dedicated ablation table that isolates Bottleneck Tokens against standard last-token pooling under identical training data, compute budget, and initialization, (ii) reporting paired statistical significance tests across the 78 datasets, and (iii) including an appendix table that explicitly lists training data volume, total compute (FLOPs), and initialization details for our model and each baseline. These changes will make the experimental controls transparent and allow readers to assess the contribution of the condensation procedure. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces novel architectural components (Bottleneck Tokens as explicit pooling) and a training mechanism (Generative Information Condensation with Condensation Mask to route signals through BToks) to adapt decoder-only MLLMs for retrieval. These are presented as solutions to identified structural gaps, with results consisting of empirical SOTA scores on the MMEB-V2 benchmark (78 datasets). No equations, first-principles derivations, or predictions are described that reduce by construction to fitted parameters or self-referential inputs. The approach is self-contained via new design choices and external benchmark evaluation, with no load-bearing self-citations or renamings of known results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim depends on two newly introduced components whose effectiveness is asserted via benchmark numbers rather than prior independent evidence.

free parameters (1)

Bottleneck Tokens
A small set of learnable tokens whose parameters are optimized during training to serve as the pooling representation.

axioms (1)

domain assumption The model architecture can route and compress semantic information through a fixed number of additional tokens when direct attention paths are masked.
Invoked when describing how the Condensation Mask converts the generative loss into supervision for the BToks.

invented entities (2)

Bottleneck Tokens no independent evidence
purpose: Explicit fixed-capacity pooling mechanism to replace implicit last-token pooling.
New learnable tokens added to the input sequence.
Condensation Mask no independent evidence
purpose: Attention mask that severs direct paths from target tokens to query tokens to force information through the BToks.
New masking strategy introduced for training.

pith-pipeline@v0.9.0 · 5597 in / 1435 out tokens · 68998 ms · 2026-05-10T15:08:26.492036+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 26 canonical work pages · 8 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisser- man, A., Simonyan, K.: Flamingo:...

work page internal anchor Pith review arXiv 2022
[2]

Think then embed: Generative con- text improves multimodal embedding.arXiv preprint arXiv:2510.05014, 2025

Cui, X., Cheng, J., Chen, H.y., Shukla, S.N., Awasthi, A., Pan, X., Ahuja, C., Mishra,S.K.,Guo,Q.,Lim,S.N.,Singh,A.,Fan,X.:Thinkthenembed:Generative context improves multimodal embedding. arXiv preprint arXiv:2510.05014 (2025), https://arxiv.org/abs/2510.05014

work page arXiv 2025
[3]

Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024

Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., Hudelot, C., Colombo, P.: ColPali: Efficient document retrieval with vision language models. In: ICLR (2025), https://arxiv.org/abs/2407.01449

work page arXiv 2025
[4]

arXiv preprint arXiv:2510.13515 , year=

Gu, T., Yang, K., Zhang, K., An, X., Feng, Z., Zhang, Y., Cai, W., Deng, J., Bing, L.: UniME-V2: MLLM-as-a-judge for universal multimodal embedding learning. In: AAAI (2026),https://arxiv.org/abs/2510.13515

work page arXiv 2026
[5]

Jaegle, A., Gimeno, F., Brock, A., Zisserman, A., Vinyals, O., Carreira, J.: Per- ceiver: General perception with iterative attention (2021),https://arxiv.org/ abs/2103.03206

work page arXiv 2021
[6]

Rzenembed: Towards comprehensive multimodal retrieval.arXiv preprint arXiv:2510.27350,

Jian, W., Zhang, Y., Liang, D., Xie, C., He, Y., Leng, D., Yin, Y.: RzenEmbed: Towards comprehensive multimodal retrieval. arXiv preprint arXiv:2510.27350 (2025),https://arxiv.org/abs/2510.27350

work page arXiv 2025
[7]

Embed- rl: Reinforcement learning for reasoning-driven multimodal embeddings.arXiv preprint arXiv:2602.13823, 2026

Jiang, H., Wang, Y., Zhu, Y., Lu, X., Qin, W., Wang, M., Wan, P., Tang, Y.: Embed-rl: Reinforcement learning for reasoning-driven multimodal embeddings. arXiv preprint arXiv:2602.13823 (2026),https://arxiv.org/abs/2602.13823

work page arXiv 2026
[8]

arXiv preprint arXiv:2407.12580 , year=

Jiang, T., Song, M., Zhang, Z., Huang, H., Deng, W., Sun, F., Zhang, Q., Wang, D., Zhuang, F.: E5-V: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580 (2024),https://arxiv.org/abs/2407.12580

work page arXiv 2024
[9]

arXiv preprint arXiv:2410.05160 , year=

Jiang, Z., Meng, R., Yang, X., Yavuz, S., Zhou, Y., Chen, W.: VLM2Vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160 (2024),https://arxiv.org/abs/2410.05160

work page arXiv 2024
[10]

Lan, Z., Niu, L., Meng, F., Zhou, J., Su, J.: Ume-r1: Exploring reasoning-driven generative multimodal embeddings (2026),https://arxiv.org/abs/2511.00405

work page arXiv 2026
[11]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Lee, C., Roy, R., Xu, M., Raiman, J., Shoeybi, M., Catanzaro, B., Ping, W.: NV- Embed: Improved techniques for training LLMs as generalist embedding models. arXiv preprint arXiv:2405.17428 (2024),https://arxiv.org/abs/2405.17428

work page internal anchor Pith review arXiv 2024
[12]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., Li, C.: LLaVA-OneVision: Easy visual task transfer. Transactions on Machine Learning Research (2024),https://arxiv.org/abs/2408.03326 16 S. Sun et al

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023), https://arxiv.org/abs/2301.12597

work page internal anchor Pith review arXiv 2023
[14]

arXiv preprint arXiv:2411.02571 , year=

Lin, S.C., Lee, C., Shoeybi, M., Lin, J., Catanzaro, B., Ping, W.: MM-Embed: Universal multimodal retrieval with multimodal LLMs. In: ICLR (2025),https: //arxiv.org/abs/2411.02571

work page arXiv 2025
[15]

In: CVPR (2025), https://arxiv.org/abs/2412.01720

Liu, Y., Chen, P., Cai, J., Jiang, X., Hu, Y., Yao, J., Wang, Y., Xie, W.: LamRA: Large multimodal model as your advanced retrieval assistant. In: CVPR (2025), https://arxiv.org/abs/2412.01720

work page arXiv 2025
[16]

arXiv preprint arXiv:2507.04590 , year=

Meng, R., Jiang, Z., Liu, Y., Su, M., Yang, X., Fu, Y., Qin, C., Chen, Z., Xu, R., Xiong, C., Zhou, Y., Chen, W., Yavuz, S.: VLM2Vec-V2: Advancing mul- timodal embedding for videos, images, and visual documents. arXiv preprint arXiv:2507.04590 (2025),https://arxiv.org/abs/2507.04590

work page arXiv 2025
[17]

In: ECCV (2022),https://arxiv.org/abs/2112.12750

Mu, N., Kirillov, A., Wagner, D., Xie, S.: SLIP: Self-supervision meets language- image pre-training. In: ECCV (2022),https://arxiv.org/abs/2112.12750

work page arXiv 2022
[18]

Representation Learning with Contrastive Predictive Coding

van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018),https://arxiv.org/ abs/1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sas- try, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021), https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harm- sen, J., Steiner, A., Zhai, X.: SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786 (2025),https://arxiv...

work page internal anchor Pith review arXiv 2025
[21]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024),https://arxiv. org/abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

In: ICML

Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: ICML. Proceedings of Machine LearningResearch,vol.119,pp.9922–9932(2020),https://arxiv.org/abs/2005. 10242

2020
[23]

Cafe: Unifying representation and gen- eration with contrastive-autoregressive finetuning.arXiv preprint arXiv:2503.19900, 2025

Yu, H., Zhao, Z., Yan, S., Korycki, L., Wang, J., He, B., Liu, J., Zhang, L., Fan, X., Yu, H.: CAFe: Unifying representation and generation with contrastive- autoregressive finetuning. arXiv preprint arXiv:2503.19900 (2025),https : / / arxiv.org/abs/2503.19900

work page arXiv 2025
[24]

Coca: Contrastive captioners are image- text foundation models

Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: CoCa: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022),https://arxiv.org/abs/2205.01917

work page arXiv 2022
[25]

Visrag: Vision-based retrieval-augmented generation on multi-modality documents

Yu, S., Tang, C., Xu, B., Cui, J., Ran, J., Yan, Y., Liu, Z., Shi, S., Qin, B., Liu, T.: VisRAG: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594 (2024),https://arxiv.org/abs/ 2410.10594

work page arXiv 2024
[26]

arXiv preprint arXiv:2312.15305 (2024),https://arxiv.org/abs/2312

Zhang, R., Zhou, Y., Wang, Y., Feng, Z., Luo, C., Wang, L., Yang, J., Wang, L.: LLaVA-Hound-DPO: Direct preference optimization for video large multimodal models. arXiv preprint arXiv:2312.15305 (2024),https://arxiv.org/abs/2312. 15305 Bottleneck Tokens for Unified Multimodal Retrieval 17

work page arXiv 2024
[27]

Gme: Improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855, 2024

Zhang, X., Zhang, Y., Xie, W., Li, M., Dai, Z., Long, D., Xie, P., Zhang, M., Li, W., Zhang, M.: GME: Improving universal multimodal retrieval by multimodal LLMs. In: CVPR (2025),https://arxiv.org/abs/2412.16855 18 S. Sun et al. A Experimental Details This section provides implementation details, training data composition, and hyper-parameter settings tha...

work page arXiv 2025