pith. machine review for the scientific record. sign in

arxiv: 2604.13710 · v3 · submitted 2026-04-15 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal retrievalfrozen MLLMsparameter-efficient adaptationshared latent queriesknowledge-aware retrievalimage-text matching
0
0 comments X

The pith

A small set of shared latent queries appended to frozen MLLMs aggregates text and image tokens into unified retrieval embeddings without any backbone updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multimodal large language models can be turned into strong retrievers by adding a handful of trainable shared latent queries to their input tokens while leaving every original parameter untouched. These queries ride the model's existing causal attention to pull multimodal context into one embedding space. The result beats full fine-tuning and LoRA on standard image-text retrieval sets and delivers larger gains on a new benchmark that tests knowledge-aware reasoning. The central idea is that the pre-trained semantic structure already contains the right knowledge; invasive updates only risk erasing it.

Core claim

SLQ adapts MLLMs for retrieval by appending a small set of shared latent queries to both text and image tokens, then relying on the model's native causal attention to aggregate multimodal context into a single embedding, all while the backbone remains completely frozen.

What carries the argument

Shared Latent Queries: a trainable but tiny set of vectors appended to the input sequence so that causal attention pools text and vision features into one retrieval vector.

If this is right

  • Retrieval performance improves over methods that update model weights, because pre-trained reasoning and world knowledge stay intact.
  • The same frozen backbone can be reused for multiple downstream tasks without repeated destructive fine-tuning.
  • Knowledge-aware retrieval benchmarks become more informative once adaptation no longer trades away structured knowledge.
  • Training cost and memory drop sharply since only the tiny query set is optimized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same query-append trick may let frozen MLLMs handle other output-head tasks such as captioning or visual question answering with minimal new parameters.
  • If the queries truly act as a lightweight aggregator, they could be swapped or extended at inference time to support different retrieval granularities without retraining.
  • The approach implies that many current fine-tuning practices for MLLMs may be over-parameterized for tasks that mainly need better input pooling.

Load-bearing premise

The model's existing attention layers can integrate the added queries well enough to produce a unified multimodal embedding without any parameter changes to the original network.

What would settle it

An experiment in which SLQ is run with the shared queries removed or replaced by random vectors and retrieval accuracy falls to the level of the unmodified frozen model on both COCO and KARR-Bench.

Figures

Figures reproduced from arXiv: 2604.13710 by Chunxiao Fan, Haoran Lou, Hao Wu, Kai Zuo, Xu Tang, Yibo Chen, Yue Ming, Yuexin Wu, Ziyan Liu.

Figure 1
Figure 1. Figure 1: Diagnostic pilot study. We compare the zero-shot retrieval performance of the last token baseline against query token method on the InternVL3-1B backbone. The retrieval score based on cosine similarity is reported as the metric. The results suggest that the query token method better aggregates global context, enabling implicit reasoning for retrieval. To validate this hypothesis, we designed a diagnostic p… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of KARR-Bench. (a) Comparison between standard explicit captions and our knowledge reasoning captions. (b) The comprehensive distribution of categories in KARR-Bench. (2) Knowledge-enhanced query generation uses GPT-5-mini to encode target identities into implicit reasoning queries without explicit names or synonyms (Figure 2a), producing 4,500 candidate samples. (3) Human verification involves fo… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the SLQ framework. (a) SLQ bridges the modality gap using a set of Shared Latent Queries that interact with the frozen MLLM via causal attention. The queries are appended to both image and text tokens, projecting them into a unified embedding space. Only the queries are optimized via contrastive learning while the MLLM backbone remains frozen. (b) Parameter efficiency comparison showing average… view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison on KARR-Bench. Left to Right: Results on InternVL3-1B, Qwen3VL-2B, Qwen3VL-4B, and InternVL3-8B. Our method (shown in Orange) consistently outperforms both invasive tuning baselines. Specifically, on the strongest InternVL3-8B backbone, our method achieves significant gains over LoRA and Full FT. These results indicate that preserving a frozen backbone while SLQ is effective for know… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the unified representation space using PCA. Red points represent Image embeddings, and Blue points represent Text embeddings. Full FT and LoRA exhibit a noticeably broader spatial spread. In contrast, SLQ maintains a much more compact distribution, resulting in a smaller centroid distance gap and demonstrating superior cross-modal alignment. The gap between SLQ and LoRA widens from 3.3% (1… view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Linguistic and Semantic Comparison between COCO and KARR-Bench. (a) The cosine similarity distribution demonstrates that KARR-Bench Captions maintain semantic relevance to the visual content despite abstraction. (b) KARR-Bench exhibits significantly higher Syntactic Complexity (MDD). (c) The Cognitive Load (Flesch-Kincaid Grade Level) doubles from COCO to KARR-Bench, indicating a shift to advanced reading … view at source ↗
Figure 8
Figure 8. Figure 8: Word Cloud Comparison. The vocabulary shifts from observational primitives in COCO (a) to abstract, functional, and relational terms in KARR-Bench (b), reflecting the requirement for implicit reasoning. ure 7(b), KARR-Bench Captions exhibit a notably higher Mean Dependency Distance (MDD) [25] compared to COCO. This metric suggests that our queries move beyond simple subject-verb-object structures, employin… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison of Text-to-Image Retrieval on KARR-Bench. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison of Image-to-Text Retrieval on KARR-Bench. [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes SLQ, a parameter-efficient adaptation method for multimodal large language models (MLLMs) in dense retrieval tasks. It appends a small set of trainable Shared Latent Queries to the frozen backbone's image and text tokens, relying on the model's native causal attention to aggregate multimodal context into unified embeddings without any backbone parameter updates. Experiments claim that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, remains competitive on MMEB, and delivers substantial gains on the newly introduced KARR-Bench for knowledge-aware reasoning retrieval.

Significance. If the empirical results hold under rigorous controls, the work would be significant for demonstrating that non-invasive adaptation via latent queries can preserve pre-trained MLLM representations and reasoning capabilities better than invasive tuning, while introducing KARR-Bench as a useful new evaluation resource for knowledge-intensive multimodal retrieval.

major comments (3)
  1. [§3] §3 (Method): The central claim that appending a small set of shared latent queries to frozen MLLM tokens suffices for cross-modal aggregation via pre-trained causal attention lacks direct verification. No attention map visualizations, ablation on query-token interactions, or analysis of QKV projections applied to the new tokens are provided to rule out the possibility that the queries receive negligible or non-selective attention, which would undermine the non-invasive adaptation argument.
  2. [§4.2, Table 2] §4.2 and Table 2 (Experiments on COCO/Flickr30K): The reported outperformance over full fine-tuning and LoRA is presented without error bars, multiple random seeds, or statistical significance tests. Given that the soundness assessment notes unverified experimental controls, these gains cannot yet be treated as robust evidence for the superiority of the frozen-backbone approach.
  3. [§4.3] §4.3 (KARR-Bench): The substantial gains on the new benchmark are load-bearing for the knowledge-aware claim, yet the paper provides insufficient detail on how KARR-Bench differs from standard retrieval sets in terms of query construction, negative sampling, and evaluation metrics, nor does it include ablations isolating the contribution of the shared queries versus other training choices.
minor comments (2)
  1. [§3.2] The hyperparameter choice for the number of shared latent queries is mentioned but not accompanied by a sensitivity analysis or ablation table showing performance as a function of query count.
  2. [§4] Figure captions and axis labels in the experimental section could be expanded to explicitly state what each baseline configuration entails (e.g., whether LoRA rank and target modules match the SLQ setup).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below and will revise the manuscript accordingly to strengthen the empirical support and clarity.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that appending a small set of shared latent queries to frozen MLLM tokens suffices for cross-modal aggregation via pre-trained causal attention lacks direct verification. No attention map visualizations, ablation on query-token interactions, or analysis of QKV projections applied to the new tokens are provided to rule out the possibility that the queries receive negligible or non-selective attention, which would undermine the non-invasive adaptation argument.

    Authors: We agree that direct verification would strengthen the claim. In the revised version, we will add attention map visualizations illustrating how the shared latent queries aggregate information from image and text tokens. We will also include ablations on query count and interactions, plus analysis of QKV projections on the new tokens to demonstrate selective attention. revision: yes

  2. Referee: [§4.2, Table 2] §4.2 and Table 2 (Experiments on COCO/Flickr30K): The reported outperformance over full fine-tuning and LoRA is presented without error bars, multiple random seeds, or statistical significance tests. Given that the soundness assessment notes unverified experimental controls, these gains cannot yet be treated as robust evidence for the superiority of the frozen-backbone approach.

    Authors: We acknowledge the need for statistical rigor. We will rerun the COCO and Flickr30K experiments with multiple random seeds, report means and standard deviations as error bars in the updated Table 2, and add statistical significance tests (e.g., paired t-tests) to confirm the improvements. revision: yes

  3. Referee: [§4.3] §4.3 (KARR-Bench): The substantial gains on the new benchmark are load-bearing for the knowledge-aware claim, yet the paper provides insufficient detail on how KARR-Bench differs from standard retrieval sets in terms of query construction, negative sampling, and evaluation metrics, nor does it include ablations isolating the contribution of the shared queries versus other training choices.

    Authors: We will expand the KARR-Bench section with detailed explanations of query construction, negative sampling, and evaluation metrics, highlighting differences from standard sets. We will also add ablations isolating the shared queries' contribution versus other training choices on this benchmark. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation independent of method definition

full rationale

The paper introduces SLQ as a parameter-efficient adaptation technique by appending trainable shared latent queries to frozen MLLM tokens and leveraging native causal attention for multimodal aggregation. This definition is independent of the reported benchmark results. The central claims rest on direct experimental comparisons (outperformance on COCO/Flickr30K, competitive on MMEB, gains on KARR-Bench) rather than any derivation, equation chain, or self-citation that reduces the outcome to fitted inputs or prior author results by construction. No load-bearing step equates a prediction to its own training data or imports uniqueness via self-reference.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach assumes standard transformer attention can aggregate context from appended queries without backbone updates; no new physical or mathematical axioms are introduced beyond existing MLLM architecture.

free parameters (1)
  • number of shared latent queries
    A small set is introduced but the exact count is not specified in the abstract; this hyperparameter is chosen to enable aggregation.
axioms (1)
  • domain assumption Causal attention in the frozen MLLM can effectively aggregate multimodal context from appended shared queries.
    Invoked in the description of how queries leverage native attention to produce unified embeddings.
invented entities (1)
  • Shared Latent Queries no independent evidence
    purpose: To bridge text and image modalities into a unified embedding space without updating model parameters.
    New component introduced in the method; no independent evidence outside the paper's experiments is provided.

pith-pipeline@v0.9.0 · 5528 in / 1363 out tokens · 16950 ms · 2026-05-12T00:50:27.398184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  4. [4]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  6. [6]

    Flame: Frozen large language models enable data- efficient language-image pre-training

    Anjia Cao, Xing Wei, and Zhiheng Ma. Flame: Frozen large language models enable data- efficient language-image pre-training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4080–4090, 2025

  7. [7]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  8. [8]

    Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

  9. [9]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024

  10. [10]

    Breaking the modality barrier: Universal embedding learning with multimodal llms

    Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, and Jiankang Deng. Breaking the modality barrier: Universal embedding learning with multimodal llms. InProceedings of the 33rd ACM International Conference on Multimedia, pages 2860–2869, 2025

  11. [11]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021

  12. [12]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021

  13. [13]

    Visual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 10

  14. [14]

    E5-V: universal embeddings with multi- modal large language models

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580, 2024

  15. [15]

    Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2404.04120, 2024

    Ziyan Jiang, Rui Meng, Xinyi Yang, Eshed Yumer, Tong Wang, Yinpeng Yue, and Chi Zhang. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2404.04120, 2024

  16. [16]

    Maple: Multi-modal prompt learning

    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fa- had Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023

  17. [17]

    P. Langley. Crafting papers on machine learning. In Pat Langley, editor,Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pages 1207–1216, Stanford, CA, 2000. Morgan Kaufmann

  18. [18]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  19. [19]

    Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022

  20. [20]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

  21. [21]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

    Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022

  22. [22]

    arXiv preprint arXiv:2411.02571 , year=

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024

  23. [23]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  24. [24]

    Idmr: Towards instance-driven precise visual correspondence in multimodal retrieval

    Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, and Chaochao Lu. Idmr: Towards instance-driven precise visual correspondence in multimodal retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6320–6329, 2025

  25. [25]

    Dependency distance: A new perspective on syntactic patterns in natural languages.Physics of life reviews, 21:171–193, 2017

    Haitao Liu, Chunshan Xu, and Junying Liang. Dependency distance: A new perspective on syntactic patterns in natural languages.Physics of life reviews, 21:171–193, 2017

  26. [26]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  27. [27]

    Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

  28. [28]

    Image retrieval on real-life images with pre-trained vision-and-language models

    Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021

  29. [29]

    Llava-sp: Enhancing visual representation with visual spatial tokens for mllms

    Haoran Lou, Chunxiao Fan, Ziyan Liu, Yuexin Wu, and Xinliang Wang. Llava-sp: Enhancing visual representation with visual spatial tokens for mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22014–22024, 2025. 11

  30. [30]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

  31. [31]

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InProceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

  32. [32]

    Tiger: Unifying text-to-image generation and retrieval with large multimodal models

    Leigang Qu, Haochuan Li, Tan Wang, Wenjie Wang, Yongqi Li, Liqiang Nie, and Tat-Seng Chua. Tiger: Unifying text-to-image generation and retrieval with large multimodal models. arXiv preprint arXiv:2406.05814, 2024

  33. [33]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the 17th International Conference on Machine Learning (ICML 2021), pages 8748–8763, 2021

  34. [34]

    Evaluating text complexity and flesch-kincaid grade level.Journal of social studies education research, 8(3):238–248, 2017

    Marina Solnyshkina, Radif Zamaletdinov, Ludmila Gorodetskaya, and Azat Gabitov. Evaluating text complexity and flesch-kincaid grade level.Journal of social studies education research, 8(3):238–248, 2017

  35. [35]

    Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks

    Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5227–5237, 2022

  36. [36]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  37. [37]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  38. [38]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning, pages 9929–9939. PMLR, 2020

  39. [39]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  40. [40]

    Fashion iq: A new dataset towards retrieving images by natural language feedback

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317, 2021

  41. [41]

    Visrag: Vision-based retrieval-augmented generation on multi-modality documents

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594, 2024

  42. [42]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  43. [43]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  44. [44]

    Magi- clens: Self-supervised image retrieval with open-ended in- structions.arXiv preprint arXiv:2403.19651, 2024

    Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions.arXiv preprint arXiv:2403.19651, 2024. 12

  45. [45]

    GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855, 2024

  46. [46]

    Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

    Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024

  47. [47]

    Megapairs: Massive data synthesis for universal multimodal retrieval

    Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Ja- son Zhang, and Defu Lian. Megapairs: Massive data synthesis for universal multimodal retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19076–19095, 2025

  48. [48]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022

  49. [49]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

  50. [50]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 Appendix Overview This supplementary document is organized as follows: • Sec. A shows the limita...

  51. [51]

    white”, “sitting

    and CIRR [28] benchmarks. We compare our proposed SLQ method against the baseline E5-V and standard fine-tuning strategies (Full FT and LoRA) across different model scales (InternVL3-1B and 8B). As shown in Table 10, SLQ consistently achieves superior performance compared to other methods. Specifically: On the InternVL3-1B backbone, SLQ outperforms Full F...

  52. [52]

    man”, “woman

    Generic human references (e.g., “man”, “woman”, “person”, “people”, “crowd”)

  53. [53]

    wall”, “floor

    Background or environmental elements (e.g., “wall”, “floor”, “sky”, “grass”, “street”, “scene”, “view”)

  54. [54]

    performance

    Abstract concepts, actions, or events (e.g., “performance”, “activity”, “event”, “situation”)

  55. [55]

    object”, “thing

    Low-semantic or ambiguous objects (e.g., “object”, “thing”, “shape”, “line”). Constraint:Do not infer or extrapolate object categories beyond what can be determined with certainty from the caption and image. Output:If a valid entity exists, output its name. Otherwise, outputNone. Table 12: Stage 1 prompt for KARR-Bench construction. Instruction:Given a va...

  56. [56]

    Do not include the entity name or any of its morphological variants

  57. [57]

    Do not use direct synonyms or trivial paraphrases

  58. [58]

    gpt-5-mini-2025-08-07

    Avoid describing visual appearance (e.g., color or shape) unless required by the reasoning itself. Rejection Rule:If the entity cannot be described using stable and widely accepted knowledge, output CANNOT_REWRITE. Table 13: Stage 2 prompt for KARR-Bench construction. G Prompts for KARR-Bench Construction To ensure the reproducibility of ourKARR-Bench, we...