Recognition: 2 theorem links
· Lean TheoremSLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3
The pith
A small set of shared latent queries appended to frozen MLLMs aggregates text and image tokens into unified retrieval embeddings without any backbone updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SLQ adapts MLLMs for retrieval by appending a small set of shared latent queries to both text and image tokens, then relying on the model's native causal attention to aggregate multimodal context into a single embedding, all while the backbone remains completely frozen.
What carries the argument
Shared Latent Queries: a trainable but tiny set of vectors appended to the input sequence so that causal attention pools text and vision features into one retrieval vector.
If this is right
- Retrieval performance improves over methods that update model weights, because pre-trained reasoning and world knowledge stay intact.
- The same frozen backbone can be reused for multiple downstream tasks without repeated destructive fine-tuning.
- Knowledge-aware retrieval benchmarks become more informative once adaptation no longer trades away structured knowledge.
- Training cost and memory drop sharply since only the tiny query set is optimized.
Where Pith is reading between the lines
- The same query-append trick may let frozen MLLMs handle other output-head tasks such as captioning or visual question answering with minimal new parameters.
- If the queries truly act as a lightweight aggregator, they could be swapped or extended at inference time to support different retrieval granularities without retraining.
- The approach implies that many current fine-tuning practices for MLLMs may be over-parameterized for tasks that mainly need better input pooling.
Load-bearing premise
The model's existing attention layers can integrate the added queries well enough to produce a unified multimodal embedding without any parameter changes to the original network.
What would settle it
An experiment in which SLQ is run with the shared queries removed or replaced by random vectors and retrieval accuracy falls to the level of the unmodified frozen model on both COCO and KARR-Bench.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench, validating that preserving the pre-trained representations via non-invasive adaptation is an effective strategy for MLLM-based retrieval. The code is available under: https://github.com/CnFaker/SLQ.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SLQ, a parameter-efficient adaptation method for multimodal large language models (MLLMs) in dense retrieval tasks. It appends a small set of trainable Shared Latent Queries to the frozen backbone's image and text tokens, relying on the model's native causal attention to aggregate multimodal context into unified embeddings without any backbone parameter updates. Experiments claim that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, remains competitive on MMEB, and delivers substantial gains on the newly introduced KARR-Bench for knowledge-aware reasoning retrieval.
Significance. If the empirical results hold under rigorous controls, the work would be significant for demonstrating that non-invasive adaptation via latent queries can preserve pre-trained MLLM representations and reasoning capabilities better than invasive tuning, while introducing KARR-Bench as a useful new evaluation resource for knowledge-intensive multimodal retrieval.
major comments (3)
- [§3] §3 (Method): The central claim that appending a small set of shared latent queries to frozen MLLM tokens suffices for cross-modal aggregation via pre-trained causal attention lacks direct verification. No attention map visualizations, ablation on query-token interactions, or analysis of QKV projections applied to the new tokens are provided to rule out the possibility that the queries receive negligible or non-selective attention, which would undermine the non-invasive adaptation argument.
- [§4.2, Table 2] §4.2 and Table 2 (Experiments on COCO/Flickr30K): The reported outperformance over full fine-tuning and LoRA is presented without error bars, multiple random seeds, or statistical significance tests. Given that the soundness assessment notes unverified experimental controls, these gains cannot yet be treated as robust evidence for the superiority of the frozen-backbone approach.
- [§4.3] §4.3 (KARR-Bench): The substantial gains on the new benchmark are load-bearing for the knowledge-aware claim, yet the paper provides insufficient detail on how KARR-Bench differs from standard retrieval sets in terms of query construction, negative sampling, and evaluation metrics, nor does it include ablations isolating the contribution of the shared queries versus other training choices.
minor comments (2)
- [§3.2] The hyperparameter choice for the number of shared latent queries is mentioned but not accompanied by a sensitivity analysis or ablation table showing performance as a function of query count.
- [§4] Figure captions and axis labels in the experimental section could be expanded to explicitly state what each baseline configuration entails (e.g., whether LoRA rank and target modules match the SLQ setup).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point-by-point below and will revise the manuscript accordingly to strengthen the empirical support and clarity.
read point-by-point responses
-
Referee: [§3] §3 (Method): The central claim that appending a small set of shared latent queries to frozen MLLM tokens suffices for cross-modal aggregation via pre-trained causal attention lacks direct verification. No attention map visualizations, ablation on query-token interactions, or analysis of QKV projections applied to the new tokens are provided to rule out the possibility that the queries receive negligible or non-selective attention, which would undermine the non-invasive adaptation argument.
Authors: We agree that direct verification would strengthen the claim. In the revised version, we will add attention map visualizations illustrating how the shared latent queries aggregate information from image and text tokens. We will also include ablations on query count and interactions, plus analysis of QKV projections on the new tokens to demonstrate selective attention. revision: yes
-
Referee: [§4.2, Table 2] §4.2 and Table 2 (Experiments on COCO/Flickr30K): The reported outperformance over full fine-tuning and LoRA is presented without error bars, multiple random seeds, or statistical significance tests. Given that the soundness assessment notes unverified experimental controls, these gains cannot yet be treated as robust evidence for the superiority of the frozen-backbone approach.
Authors: We acknowledge the need for statistical rigor. We will rerun the COCO and Flickr30K experiments with multiple random seeds, report means and standard deviations as error bars in the updated Table 2, and add statistical significance tests (e.g., paired t-tests) to confirm the improvements. revision: yes
-
Referee: [§4.3] §4.3 (KARR-Bench): The substantial gains on the new benchmark are load-bearing for the knowledge-aware claim, yet the paper provides insufficient detail on how KARR-Bench differs from standard retrieval sets in terms of query construction, negative sampling, and evaluation metrics, nor does it include ablations isolating the contribution of the shared queries versus other training choices.
Authors: We will expand the KARR-Bench section with detailed explanations of query construction, negative sampling, and evaluation metrics, highlighting differences from standard sets. We will also add ablations isolating the shared queries' contribution versus other training choices on this benchmark. revision: yes
Circularity Check
No significant circularity; empirical validation independent of method definition
full rationale
The paper introduces SLQ as a parameter-efficient adaptation technique by appending trainable shared latent queries to frozen MLLM tokens and leveraging native causal attention for multimodal aggregation. This definition is independent of the reported benchmark results. The central claims rest on direct experimental comparisons (outperformance on COCO/Flickr30K, competitive on MMEB, gains on KARR-Bench) rather than any derivation, equation chain, or self-citation that reduces the outcome to fitted inputs or prior author results by construction. No load-bearing step equates a prediction to its own training data or imports uniqueness via self-reference.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of shared latent queries
axioms (1)
- domain assumption Causal attention in the frozen MLLM can effectively aggregate multimodal context from appended shared queries.
invented entities (1)
-
Shared Latent Queries
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space... only these learnable latent queries while freezing the backbone
-
IndisputableMonolith/Cost/FunctionalEquation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we optimize only these learnable latent queries while freezing the backbone... contrastive learning objective
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Didi Zhu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page 2025
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Flame: Frozen large language models enable data- efficient language-image pre-training
Anjia Cao, Xing Wei, and Zhiheng Ma. Flame: Frozen large language models enable data- efficient language-image pre-training. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4080–4090, 2025
work page 2025
-
[7]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
work page 2024
-
[8]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023
work page 2023
-
[9]
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
Breaking the modality barrier: Universal embedding learning with multimodal llms
Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, and Jiankang Deng. Breaking the modality barrier: Universal embedding learning with multimodal llms. InProceedings of the 33rd ACM International Conference on Multimedia, pages 2860–2869, 2025
work page 2025
-
[11]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[13]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InEuropean conference on computer vision, pages 709–727. Springer, 2022. 10
work page 2022
-
[14]
E5-V: universal embeddings with multi- modal large language models
Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580, 2024
-
[15]
Ziyan Jiang, Rui Meng, Xinyi Yang, Eshed Yumer, Tong Wang, Yinpeng Yue, and Chi Zhang. Vlm2vec: Training vision-language models for massive multimodal embedding tasks.arXiv preprint arXiv:2404.04120, 2024
-
[16]
Maple: Multi-modal prompt learning
Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fa- had Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19113–19122, 2023
work page 2023
-
[17]
P. Langley. Crafting papers on machine learning. In Pat Langley, editor,Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pages 1207–1216, Stanford, CA, 2000. Morgan Kaufmann
work page 2000
-
[18]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023
work page 2023
-
[19]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation. InInternational conference on machine learning, pages 12888–12900. PMLR, 2022
work page 2022
-
[20]
Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026
work page internal anchor Pith review arXiv 2026
-
[21]
Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning
Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022
work page 2022
-
[22]
arXiv preprint arXiv:2411.02571 , year=
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms.arXiv preprint arXiv:2411.02571, 2024
-
[23]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014
work page 2014
-
[24]
Idmr: Towards instance-driven precise visual correspondence in multimodal retrieval
Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, and Chaochao Lu. Idmr: Towards instance-driven precise visual correspondence in multimodal retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6320–6329, 2025
work page 2025
-
[25]
Haitao Liu, Chunshan Xu, and Junying Liang. Dependency distance: A new perspective on syntactic patterns in natural languages.Physics of life reviews, 21:171–193, 2017
work page 2017
-
[26]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024
work page 2024
-
[27]
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024
work page 2024
-
[28]
Image retrieval on real-life images with pre-trained vision-and-language models
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. Image retrieval on real-life images with pre-trained vision-and-language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2125–2134, 2021
work page 2021
-
[29]
Llava-sp: Enhancing visual representation with visual spatial tokens for mllms
Haoran Lou, Chunxiao Fan, Ziyan Liu, Yuexin Wu, and Xinliang Wang. Llava-sp: Enhancing visual representation with visual spatial tokens for mllms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22014–22024, 2025. 11
work page 2025
-
[30]
Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025
work page internal anchor Pith review arXiv 2025
-
[31]
Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InProceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015
work page 2015
-
[32]
Tiger: Unifying text-to-image generation and retrieval with large multimodal models
Leigang Qu, Haochuan Li, Tan Wang, Wenjie Wang, Yongqi Li, Liqiang Nie, and Tat-Seng Chua. Tiger: Unifying text-to-image generation and retrieval with large multimodal models. arXiv preprint arXiv:2406.05814, 2024
-
[33]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InProceedings of the 17th International Conference on Machine Learning (ICML 2021), pages 8748–8763, 2021
work page 2021
-
[34]
Marina Solnyshkina, Radif Zamaletdinov, Ludmila Gorodetskaya, and Azat Gabitov. Evaluating text complexity and flesch-kincaid grade level.Journal of social studies education research, 8(3):238–248, 2017
work page 2017
-
[35]
Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks
Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5227–5237, 2022
work page 2022
-
[36]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational conference on machine learning, pages 9929–9939. PMLR, 2020
work page 2020
-
[39]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Fashion iq: A new dataset towards retrieving images by natural language feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317, 2021
work page 2021
-
[41]
Visrag: Vision-based retrieval-augmented generation on multi-modality documents
Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents.arXiv preprint arXiv:2410.10594, 2024
-
[42]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024
work page 2024
-
[43]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023
work page 2023
-
[44]
Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions.arXiv preprint arXiv:2403.19651, 2024. 12
-
[45]
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms.arXiv preprint arXiv:2412.16855, 2024
work page internal anchor Pith review arXiv 2024
-
[46]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024
-
[47]
Megapairs: Massive data synthesis for universal multimodal retrieval
Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Ja- son Zhang, and Defu Lian. Megapairs: Massive data synthesis for universal multimodal retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19076–19095, 2025
work page 2025
-
[48]
Conditional prompt learning for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022
work page 2022
-
[49]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 13 Appendix Overview This supplementary document is organized as follows: • Sec. A shows the limita...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
and CIRR [28] benchmarks. We compare our proposed SLQ method against the baseline E5-V and standard fine-tuning strategies (Full FT and LoRA) across different model scales (InternVL3-1B and 8B). As shown in Table 10, SLQ consistently achieves superior performance compared to other methods. Specifically: On the InternVL3-1B backbone, SLQ outperforms Full F...
- [52]
-
[53]
Background or environmental elements (e.g., “wall”, “floor”, “sky”, “grass”, “street”, “scene”, “view”)
-
[54]
Abstract concepts, actions, or events (e.g., “performance”, “activity”, “event”, “situation”)
-
[55]
Low-semantic or ambiguous objects (e.g., “object”, “thing”, “shape”, “line”). Constraint:Do not infer or extrapolate object categories beyond what can be determined with certainty from the caption and image. Output:If a valid entity exists, output its name. Otherwise, outputNone. Table 12: Stage 1 prompt for KARR-Bench construction. Instruction:Given a va...
-
[56]
Do not include the entity name or any of its morphological variants
-
[57]
Do not use direct synonyms or trivial paraphrases
-
[58]
Avoid describing visual appearance (e.g., color or shape) unless required by the reasoning itself. Rejection Rule:If the entity cannot be described using stable and widely accepted knowledge, output CANNOT_REWRITE. Table 13: Stage 2 prompt for KARR-Bench construction. G Prompts for KARR-Bench Construction To ensure the reproducibility of ourKARR-Bench, we...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.