arxiv: 2604.03231 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Ankan Deria , Komal Kumar , Xilin He , Imran Razzak , Hisham Cholakkal , Fahad Shahbaz Khan , Salman Khan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsmulti-encoder fusioncontrastive learningself-supervised learningvisual groundingtoken fusioncross-attentionDINO encoder

0 comments

The pith

Fusing a contrastive vision encoder with a self-supervised DINO encoder via targeted aggregation and cross-attention produces better visual tokens for decoder-only language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how to combine two different kinds of vision encoders that have complementary strengths: one trained with image-text contrastive objectives for alignment and retrieval, and another trained self-supervised for denser semantic features. It introduces CoME-VL, a modular fusion approach that first aggregates multi-layer features using entropy guidance and orthogonality constraints to cut redundancy, then applies RoPE-enhanced cross-attention to align the resulting token grids into compact representations. These fused tokens plug into existing decoder-only LLM pipelines with almost no architectural change. Experiments show the combined system beats single-encoder baselines by 4.9 percent on average for visual understanding tasks and 5.4 percent on grounding tasks, reaching state-of-the-art on RefCOCO detection.

Core claim

CoME-VL performs representation-level fusion of a contrastively pretrained vision encoder and a self-supervised DINO encoder by applying entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, followed by RoPE-enhanced cross-attention to align heterogeneous token grids and generate compact fused visual tokens that can be injected into a decoder-only LLM.

What carries the argument

The modular fusion framework that integrates contrastive and self-supervised vision encoders through entropy-guided multi-layer aggregation, orthogonality-constrained projections, and RoPE-enhanced cross-attention to produce compact, non-redundant visual tokens.

If this is right

The fused visual tokens improve visual understanding tasks by an average of 4.9 percent over single-encoder baselines.
Grounding performance rises by an average of 5.4 percent, with state-of-the-art results on RefCOCO detection.
The method requires only minimal changes to standard decoder-only VLM pipelines.
Ablation studies confirm that both the entropy-guided merging and the non-redundant mixing steps contribute to the observed gains.
Complementary contrastive and self-supervised signals together produce stronger representations than either signal alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion recipe could be tested on other pairs of specialized encoders beyond CLIP-style and DINO encoders.
If the orthogonality constraint proves robust across scales, it may allow stacking more than two encoders without rapid growth in token redundancy.
The approach suggests that future VLMs could leverage off-the-shelf pretrained encoders rather than training ever-larger single vision backbones from scratch.

Load-bearing premise

That entropy-guided aggregation plus orthogonality constraints and RoPE cross-attention will reliably remove redundancy between the two encoders while keeping their complementary information intact.

What would settle it

Running the same benchmarks with the fused tokens replaced by either encoder alone and observing no consistent gain in accuracy or grounding metrics would falsify the claim that the fusion step is responsible for the reported improvements.

Figures

Figures reproduced from arXiv: 2604.03231 by Ankan Deria, Fahad Shahbaz Khan, Hisham Cholakkal, Imran Razzak, Komal Kumar, Salman Khan, Xilin He.

**Figure 1.** Figure 1: CoME-VL uses token entropy analysis to identify complementary layer ranges from multiple vision encoders (SigLIP2 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Semantic feature analysis. (a) Layer-wise comparison of spatial attention in DINOv3 and SigLIP2. We visualize attention [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed multi-encoder, multi-scale vision-language framework. Images are processed by two com [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on PixMo pointing. Compared to prior VLMs, CoME-VL demonstrates more precise coordinate-level [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative examples of CoME-VL on chart understanding, document/table reasoning, localization, pointing, and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Component-wise contribution analysis on PixMo [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Layer-wise attention rollout for deeper layers: DINOv3 illustrating the transition from spatially coherent object-level [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Layer-wise attention rollout for deeper layers: SigLIP2 illustrating the transition from spatially coherent object-level [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Layer-wise attention rollout for deeper layers: DINOv3 illustrating the transition from spatially coherent object-level [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Layer-wise attention rollout for deeper layers: SigLIP2, illustrating the transition from spatially coherent object-level [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoME-VL shows 5% gains on VL tasks by fusing CLIP-style and DINO encoders via entropy-guided aggregation and RoPE cross-attention, but the lifts may trace more to dual capacity than the exact fusion steps.

read the letter

The core takeaway is that this modular setup combines a contrastive vision encoder with a self-supervised DINO one to produce fused tokens that plug into a decoder-only LLM, delivering average gains of 4.9% on visual understanding and 5.4% on grounding, plus SOTA on RefCOCO detection. The approach keeps changes to standard VLM pipelines minimal and avoids full retraining of everything from scratch. That practical framing is the strongest part of the work. The specific pieces—entropy-guided multi-layer aggregation with orthogonality constraints plus RoPE-enhanced cross-attention to handle heterogeneous token grids—are not the default in single-encoder baselines, so the combination counts as new. The paper also runs ablations on layer merging and non-redundant mixing, which helps show how the two encoder types interact. Those elements give the paper a clear empirical story on complementarity. The soft spot is exactly the one flagged in the stress test. The reported margins could come simply from running two encoders instead of one under the same LLM and training regime, rather than from the entropy or RoPE machinery itself. The abstract mentions ablations, but without numbers that isolate each component and show simpler averaging or concatenation falling back to baseline levels, it is hard to pin the gains on the proposed mechanisms. Full details on controls, exact baselines, and training budgets would clarify this. The paper is aimed at people building or scaling VLMs who want to reuse existing pre-trained encoders without massive new compute. A reader working on efficient multimodal fusion would get concrete ideas and benchmark numbers to test. It is coherent enough on its own terms to deserve a serious referee who can check the experimental isolation and reproducibility.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CoME-VL, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder for vision-language models. It performs representation-level fusion via entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, followed by RoPE-enhanced cross-attention to align heterogeneous token grids. The central claim is that this yields consistent outperformance over single-encoder baselines, with average gains of 4.9% on visual understanding tasks and 5.4% on grounding tasks, plus SOTA results on RefCOCO detection.

Significance. If the gains can be isolated to the entropy-guided aggregation and RoPE fusion rather than dual-encoder capacity, the work would provide a practical, modular route to exploit complementary contrastive and self-supervised visual signals in decoder-only VLMs, with potential benefits for robustness on dense understanding and grounding benchmarks.

major comments (2)

[Ablation studies] Ablation studies section: The reported ablations on layer merging, non-redundant feature mixing, and fusion capacity do not include a control that simply concatenates or averages the two encoder outputs (under identical LLM backbone, training regime, and token budget). Without this baseline, the 4.9% / 5.4% margins cannot be attributed to the entropy-guided aggregation or RoPE cross-attention rather than the mere addition of a second visual encoder.
[Method] Method section on orthogonality-constrained projections: The paper does not provide the explicit loss term or projection matrix formulation used to enforce orthogonality (e.g., no equation analogous to a Frobenius-norm penalty on off-diagonal elements). This detail is load-bearing for the claim that redundancy is reduced while preserving complementary information.

minor comments (2)

[Abstract] Abstract: The phrase 'state-of-the-art performance on RefCOCO for detection' should specify the exact metric (e.g., Acc@0.5) and the prior SOTA reference for direct comparison.
[Figures] Figure captions: Several figures comparing token visualizations lack quantitative metrics (e.g., cosine similarity or entropy values) that would allow readers to verify the claimed reduction in redundancy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We appreciate the feedback on the ablation studies and the method description. We will revise the manuscript to address these points by adding the requested baseline and providing the explicit formulation for the orthogonality constraint.

read point-by-point responses

Referee: Ablation studies section: The reported ablations on layer merging, non-redundant feature mixing, and fusion capacity do not include a control that simply concatenates or averages the two encoder outputs (under identical LLM backbone, training regime, and token budget). Without this baseline, the 4.9% / 5.4% margins cannot be attributed to the entropy-guided aggregation or RoPE cross-attention rather than the mere addition of a second visual encoder.

Authors: We agree that including a simple concatenation or averaging baseline is crucial to isolate the contributions of our proposed entropy-guided aggregation and RoPE cross-attention. In the revised manuscript, we will add this control experiment, maintaining identical LLM backbone, training regime, and token budget. This will allow us to better attribute the observed performance gains to the specific fusion mechanisms. revision: yes
Referee: Method section on orthogonality-constrained projections: The paper does not provide the explicit loss term or projection matrix formulation used to enforce orthogonality (e.g., no equation analogous to a Frobenius-norm penalty on off-diagonal elements). This detail is load-bearing for the claim that redundancy is reduced while preserving complementary information.

Authors: We thank the referee for pointing this out. The orthogonality constraint is implemented using a Frobenius norm penalty on the off-diagonal elements of the projection matrix product, specifically L_ortho = ||W^T W - I||_F^2, where W is the projection matrix. We will include this explicit loss term and the full formulation in the revised Method section to clarify how redundancy is reduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes an empirical modular fusion framework for integrating contrastive and self-supervised vision encoders into VLMs, using components such as entropy-guided multi-layer aggregation, orthogonality-constrained projections, and RoPE-enhanced cross-attention. All central claims rest on experimental results and ablations measured against external benchmarks (e.g., 4.9% average gain on visual understanding tasks, RefCOCO SOTA), with no mathematical derivation, first-principles prediction, or fitted parameter that reduces to its own inputs by construction. The architecture is presented as a practical design choice validated externally rather than a self-referential theorem or renamed empirical pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only, no explicit free parameters, axioms, or invented entities are stated; the method relies on standard contrastive and self-supervised pretraining assumptions from prior work.

pith-pipeline@v0.9.0 · 5585 in / 1078 out tokens · 31328 ms · 2026-05-13T20:24:36.463376+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

entropy-guided multi-layer aggregation with orthogonality-constrained projections and RoPE-enhanced cross-attention
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments across diverse vision-language benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 27 internal anchors

[1]

Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harki- rat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Java- heripi, Neel Joshi, et al. 2025. Phi-4-reasoning technical report.arXiv preprint arXiv:2504.21318(2025)

work page arXiv 2025
[2]

Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, et al. 2024. Pixtral 12B.arXiv preprint arXiv:2410.07073 (2024)

work page internal anchor Pith review arXiv 2024
[3]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

work page
[4]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems35 (2022), 23716–23736

work page 2022
[5]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, K. Marathe, Yonatan Bitton, S. Gadre, and Shiori Sagawa. 2023. Open- Flamingo: An Open-Source Framework for Training Large Autoregressive Vision- Language Models.arXiv.org(2023). doi:10.48550/arXiv.2308.01390

work page internal anchor Pith review doi:10.48550/arxiv.2308.01390 2023
[6]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Frontier Large Vision- Language Model with Versatile Abilities.arXiv preprint arXiv:2308.12966(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschan- nen, Emanuele Bugliarello, et al. 2024. Paligemma: A versatile 3b vlm for transfer. arXiv preprint arXiv:2407.07726(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision. 9650–9660

work page 2021
[9]

In: CVPR

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas J. Guibas, and Fei Xia. 2024. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities.Computer Vision and Pattern Recognition(2024). doi:10.1109/CVPR52733.2024.01370

work page doi:10.1109/cvpr52733.2024.01370 2024
[10]

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. 2024. ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models. (2024)

work page 2024
[11]

Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, and Bin Xiao. 2025. Florence-vl: Enhancing vision-language models with generative vision encoder and depth-breadth fusion. InProceedings of the Computer Vision and Pattern Recognition Conference. 24928–24938

work page 2025
[12]

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao

work page
[13]

Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic.arXiv preprint arXiv:2306.15195(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

WL Chiang, Z Li, Z Lin, Y Sheng, Z Wu, H Zhang, L Zheng, S Zhuang, Y Zhuang, JE Gonzalez, et al. [n. d.]. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, Mar. 2023

work page 2023
[15]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, A. Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. Neural Information Processing Systems(2023). doi:10.48550/arXiv.2305.06500

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.06500 2023
[16]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. Instructblip: Towards general-purpose vision-language models with instruction tuning.arXiv preprint arXiv:2305.06500(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al

work page
[18]

InProceedings of the Computer Vision and Pattern Recognition Conference

Molmo and pixmo: Open weights and open data for state-of-the-art vision- language models. InProceedings of the Computer Vision and Pattern Recognition Conference. 91–104

work page
[19]

Ankan Deria, Komal Kumar, Snehashis Chakraborty, Dwarikanath Mahapatra, and Sudipta Roy. 2024. Inverge: Intelligent visual encoder for bridging modalities in report generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2028–2038

work page 2024
[20]

Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, and Imran Razzak. 2026. MedMO: Grounding and Understanding Multi- modal Large Language Model for Medical Images.arXiv preprint arXiv:2602.06965 (2026)

work page arXiv 2026
[21]

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdh- ery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. 2023. PaLM-E: An embodied multimodal language model.arXiv preprint arXiv:2303.03378(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Mengfei Du, Binhao Wu, Jiwen Zhang, Zhihao Fan, Zejun Li, Ruipu Luo, Xuan- Jing Huang, and Zhongyu Wei. 2024. Delan: Dual-level alignment for vision- and-language navigation by cross-modal contrastive learning. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 4605–4616

work page 2024
[23]

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. 2025. Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017 (2025)

work page arXiv 2025
[24]

Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, and Xiaopeng Zhang. 2025. Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision.Information Fusion(2025). doi:10.1016/j.inffus.2025.103652

work page doi:10.1016/j.inffus.2025.103652 2025
[25]

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. 2025. GLM-4.1 V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. arXiv preprint arXiv:2507.01006(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Fiaz, Al- ham Fikri Aji, and Hisham Cholakkal

Mohamed Fazli Mohamed Imam, Rufael Marew, Jameel Hassan, M. Fiaz, Al- ham Fikri Aji, and Hisham Cholakkal. 2024. CLIP meets DINO for Tun- ing Zero-Shot Classifier using Unlabeled Image Collections.arXiv.org(2024). doi:10.48550/arXiv.2411.19346

work page doi:10.48550/arxiv.2411.19346 2024
[27]

Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin’e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, and Hongkai Xiong. 2023. From clip to dino: Vi- sual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825(2023)

work page arXiv 2023
[28]

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. 2024. Prismatic vlms: Investigating the design space of visually-conditioned language models. InForty-first International Conference on Machine Learning

work page 2024
[29]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. 787–798

work page 2014
[30]

Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. 2023. Grounding Language Models to Images for Multimodal Inputs and Outputs. (2023)

work page 2023
[31]

Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip HS Torr, Fa- had Shahbaz Khan, and Salman Khan. 2025. Llm post-training: A deep dive into reasoning large language models.arXiv preprint arXiv:2502.21321(2025)

work page arXiv 2025
[32]

Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. 2024. What matters when building vision-language models?Neural Information Processing Systems(2024). doi:10.48550/arXiv.2405.02246

work page doi:10.48550/arxiv.2405.02246 2024
[33]

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023. Otter: A multi-modal model with in-context instruction tuning.arXiv preprint arXiv:2305.03726(2023)

work page internal anchor Pith review arXiv 2023
[34]

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2023. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890(2023)

work page arXiv 2023
[35]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.arXiv preprint arXiv:2301.12597(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. (2022), 12888–12900

work page 2022
[37]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen

work page
[38]

Evaluating Object Hallucination in Large Vision-Language Models.arXiv preprint arXiv:2305.10355(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858(2024)

work page arXiv 2024
[40]

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Munan Ning, and Li Yuan. 2024. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models.IEEE transactions on multimedia(2024). doi:10.4 8550/arXiv.2401.15947 9

work page arXiv 2024
[41]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.arXiv preprint arXiv:2304.08485(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Jingyu Liu, Liang Wang, and Ming-Hsuan Yang. 2017. Referring expression gen- eration and comprehension via attributes. InProceedings of the IEEE International Conference on Computer Vision. 4856–4864

work page 2017
[43]

Ghosh, Ludwig Schmidt, and S

Yiming Liu, Yuhui Zhang, D. Ghosh, Ludwig Schmidt, and S. Yeung-Levy. 2025. Data or Language Supervision: What Makes CLIP Better than DINO?Conference on Empirical Methods in Natural Language Processing(2025). doi:10.18653/v1/20 25.findings-emnlp.98

work page doi:10.18653/v1/20 2025
[44]

Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, et al . 2023. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv:2305.05662(2023)

work page arXiv 2023
[45]

Run Luo, Renke Shan, Longze Chen, Ziqiang Liu, Lu Wang, Min Yang, and Xiaobo Xia. 2025. VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning.arXiv preprint arXiv:2504.19627 (2025)

work page arXiv 2025
[46]

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. 11–20

work page 2016
[47]

2023.ChatGPT: A Language Model for Conversational AI

OpenAI. 2023.ChatGPT: A Language Model for Conversational AI. Technical Report. OpenAI

work page 2023
[48]

OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. Kosmos-2: Grounding Multimodal Large Language Models to the World.arXiv:2306.14824(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sand- hini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

work page
[52]

In International conference on machine learning

Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763

work page
[53]

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, et al. 2024. Eagle: Exploring the design space for multimodal llms with mixture of encoders.arXiv preprint arXiv:2408.15998(2024)

work page arXiv 2024
[54]

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Rama- monjisoa, et al. 2025. Dinov3.arXiv preprint arXiv:2508.10104(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Woj- ciech Galuba, Marcus Rohrbach, and Douwe Kiela. 2022. Flava: A foundational language and vision alignment model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15638–15650

work page 2022
[57]

Quan Sun, Yuxin Fang, Ledell Yu Wu, Xinlong Wang, and Yue Cao. 2023. EVA- CLIP: Improved Training Techniques for CLIP at Scale.arXiv.org(2023). doi:10.4 8550/arXiv.2303.15389

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, and Xinlong Wang. 2024. EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters. arXiv.org(2024). doi:10.48550/arXiv.2402.04252

work page doi:10.48550/arxiv.2402.04252 2024
[59]

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model

work page 2023
[60]

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Veda- giri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. 2024. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.Advances in Neural Information Processing Systems37 (2024), 87310–87356

work page 2024
[61]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al . 2023. Llama: Open and efficient foundation language models. arXiv:2302.13971(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al . 2023. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks.arXiv:2305.11175 (2023)

work page arXiv 2023
[65]

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. 2025. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions.arXiv preprint arXiv:2212.10560(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[67]

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[68]

Monika Wysocza’nska, Oriane Siméoni, Michael Ramamonjisoa, Andrei Bursuc, Tomasz Trzci’nski, and Patrick P’erez. 2023. CLIP-DINOiser: Teaching CLIP a few DINO tricks.European Conference on Computer Vision(2023). doi:10.48550/a rXiv.2312.12359

work page doi:10.48550/a 2023
[69]

Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. 2024. LLaVA- CoT: Let Vision Language Models Reason Step-by-Step.arXiv.org(2024). doi:10 .48550/arXiv.2411.10440

work page arXiv 2024
[70]

Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, and Xiaohua Xie. 2024. MMA: Multi-Modal Adapter for Vision-Language Models.Computer Vision and Pattern Recognition(2024). doi:10.1109/CVPR52733.2024.02249

work page doi:10.1109/cvpr52733.2024.02249 2024
[71]

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023. mplug-owl: Modular- ization empowers large language models with multimodality.arXiv:2304.14178 (2023)

work page Pith review arXiv 2023
[72]

Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, et al. 2025. Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe.arXiv preprint arXiv:2509.18154(2025)

work page arXiv 2025
[73]

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6720–6731

work page 2019
[74]

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision. 11975–11986

work page 2023
[75]

Bingfeng Zhang, Siyue Yu, Jimin Xiao, Yunchao Wei, and Yao Zhao. 2025. Frozen CLIP-DINO: A Strong Backbone for Weakly Supervised Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence(2025). doi:10.110 9/TPAMI.2025.3543191

work page arXiv 2025
[76]

Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction- tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. 2023. GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest.arXiv:2307.03601(2023)

work page arXiv 2023
[78]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large lan- guage models.arXiv:2304.10592(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models.International Conference on Learning Representations(2023). doi:10.48550/arXiv.2304.10592 10 A Methods Details Given an image 𝐼 and a referring text query 𝑇 , our goal is to pre- dict the boundin...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.10592 2023