Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Helen Suzuki; Wei-Yao Wang; Yoshiyuki Kobayashi; Zhao Wang

arxiv: 2503.02597 · v3 · pith:UFLH42CJnew · submitted 2025-03-04 · 💻 cs.CV · cs.AI

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Wei-Yao Wang , Zhao Wang , Helen Suzuki , Yoshiyuki Kobayashi This is my paper

Pith reviewed 2026-05-23 01:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal large language modelscausal attentionvision-language misalignmentmodality-mutual attentionmultimodal understanding benchmarksdecoder-only architectures

0 comments

The pith

Unlocking causal attention into modality-mutual attention lets image tokens attend to text tokens and raises multimodal LLM performance on twelve benchmarks by 6.2 percent on average with no added parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that vision-language misalignment arises because decoder-only LLMs use causal attention, which blocks image tokens from incorporating information carried by later text tokens. Replacing the causal mask with modality-mutual attention removes this restriction while keeping intra-modality causality intact. The resulting architecture is tested on three different LLM backbones and delivers state-of-the-art scores across twelve multimodal understanding benchmarks without introducing any new trainable parameters. The design is presented as generic enough to apply to other modality pairs and larger multimodal settings.

Core claim

The paper establishes that converting the standard causal attention mask into modality-mutual attention enables image tokens to attend directly to text tokens, thereby reducing vision-language misalignment and producing higher accuracy on multimodal tasks across multiple model backbones without any increase in parameter count.

What carries the argument

Modality-mutual attention (MMA), a modified attention mask that permits image tokens to attend to text tokens while preserving the original causal ordering within each modality.

If this is right

MMA raises average performance by 6.2 percent across twelve multimodal understanding benchmarks on three different LLM backbones.
The change requires no additional parameters.
The same attention modification can be applied to other pairs of modalities.
The approach scales to a range of multimodal input scenarios without architectural redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the attention mask change is the decisive factor, comparable mask adjustments could be tested in models that process audio or video before text.
The reported gains may depend on the specific instruction-tuning data; repeating the experiments on held-out multimodal datasets would test robustness.
Extending the mutual attention pattern to three or more modalities at once remains an open implementation question left by the paper.

Load-bearing premise

Vision-language misalignment is caused mainly by image tokens being unable to see subsequent text tokens under causal attention.

What would settle it

An ablation that enables image-to-text attention yet shows no gain or a drop in factual alignment on the same twelve benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2503.02597 by Helen Suzuki, Wei-Yao Wang, Yoshiyuki Kobayashi, Zhao Wang.

**Figure 2.** Figure 2: The conventional framework for MLLMs (e.g., Molmo [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An illustration for dual-order training, where T and I indicate text and images, respectively. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The prompt template for the I&T and T&I input orders. {image patch} and {question} are [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: An illustration for our proposed modality-mutual attention (MMA), which modifies the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Illustrations sampled from the Blip3-kale dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Illustrations sampled from the Blip3-OCR dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparisons among the conventional training pipeline ((I&T) [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparisons among the conventional training pipeline ((I&T) [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparisons among the conventional training pipeline ((I&T) [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparisons among the conventional training pipeline ((I&T) [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

read the original abstract

Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (+6.2% on average across 3 LLMs backbones) without introducing additional parameters. Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MMA is a simple attention-mask change claiming 6.2% multimodal gains with no added params, but the experiments do not isolate whether the mask is what drives the result.

read the letter

This paper's main takeaway is a parameter-free tweak to attention in MLLMs that lets image tokens attend to text tokens, claiming a 6.2% average boost on 12 benchmarks across three backbones. The change turns causal attention into modality-mutual attention to reduce vision-language misalignment. What is new is treating the causal mask itself as the fixable source of misalignment rather than adding connectors or more tuning data. It does well by staying simple and reporting consistent gains without increasing parameters, which makes it easy to try. The soft spots are around the evidence for the mechanism. The description does not include ablations that hold everything else constant and only flip the attention direction between modalities. That makes it difficult to attribute the gains directly to MMA instead of implementation details or other changes. There is also no mention of checking if text-only performance remains stable, which would help confirm no new issues are introduced. The premise that causal attention is the main culprit for misalignment is plausible but not strongly isolated in the reported work. The numbers are the main support. This paper is for people experimenting with MLLM architectures who want minimal changes. A reader focused on efficient improvements to multimodal models would get something out of the benchmark results and the mask design. It deserves peer review. The idea is direct and the improvements are large enough to warrant closer look, even if more targeted experiments would help.

Referee Report

2 major / 1 minor

Summary. The paper claims that vision-language misalignment in MLLMs arises primarily because decoder-only LLMs use causal attention, which prevents earlier image tokens from attending to later text tokens. It proposes modality-mutual attention (MMA) by unlocking the causal mask to enable bidirectional cross-modal attention. This change is reported to yield state-of-the-art results on 12 multimodal understanding benchmarks (+6.2% average across three LLM backbones) with no added parameters; the design is presented as generic and scalable to other modalities.

Significance. If the performance gains can be robustly attributed to MMA via isolating controls, the result would be significant: a parameter-free architectural modification to the core attention mechanism that improves factual alignment in MLLMs. The absence of extra parameters and the claim of applicability across backbones are clear strengths. The work would encourage re-examination of causal masking assumptions in multimodal decoder-only models.

major comments (2)

[Abstract] Abstract: the central claim that MMA resolves misalignment caused by causal attention and produces the reported gains rests on benchmark numbers, yet the abstract (and by extension the manuscript) provides no ablation that holds training data, optimizer, and all other factors fixed while toggling only the image-to-text attention direction. Without this control, attribution of the +6.2% average to the proposed mechanism cannot be verified.
[Abstract] Abstract / Experimental results: no evaluation is described that checks whether pure-language modeling performance remains intact after the attention-mask change. This is load-bearing for the claim that MMA introduces no new misalignment or degradation.

minor comments (1)

[Abstract] The abstract states results on '12 multimodal understanding benchmarks' but does not enumerate them or the three LLM backbones; adding this list would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. The concerns about isolating the contribution of the attention-mask change and verifying language-only performance are valid for strengthening attribution. We respond to each point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that MMA resolves misalignment caused by causal attention and produces the reported gains rests on benchmark numbers, yet the abstract (and by extension the manuscript) provides no ablation that holds training data, optimizer, and all other factors fixed while toggling only the image-to-text attention direction. Without this control, attribution of the +6.2% average to the proposed mechanism cannot be verified.

Authors: We agree that a more granular ablation isolating only the image-to-text direction would strengthen the causal attribution. Our reported results compare baseline causal-attention models against MMA versions under identical training data, optimizer, learning rate schedule, and all other hyperparameters, with the sole change being the attention mask that enables image tokens to attend to subsequent text tokens. To directly address the referee's request, we will add a dedicated ablation table in the revised manuscript that holds every factor fixed and toggles solely the image-to-text attention direction (while preserving text-to-image causality) to quantify its isolated contribution to the observed gains. revision: yes
Referee: [Abstract] Abstract / Experimental results: no evaluation is described that checks whether pure-language modeling performance remains intact after the attention-mask change. This is load-bearing for the claim that MMA introduces no new misalignment or degradation.

Authors: When the input sequence contains only text tokens, the MMA mask is identical to the original causal mask because no image tokens exist to create cross-modal interactions. Consequently, language-only behavior is unchanged by construction. Nevertheless, to make this explicit and address the referee's concern, we will include language-only benchmark results (e.g., on standard text-only suites) in the revised experimental section to empirically confirm that no degradation or new misalignment is introduced. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains rest on reported results, not self-referential equations or fits

full rationale

The paper proposes replacing causal attention with modality-mutual attention (MMA) to allow image tokens to attend to text tokens, claiming this resolves vision-language misalignment and yields +6.2% average gains on 12 benchmarks across 3 backbones without added parameters. No equations, parameter fits, or derivations appear in the provided text. The central claim is an architectural modification validated by external benchmark numbers rather than any quantity that reduces to its own inputs by construction. No self-citation chains, ansatzes smuggled via prior work, or renamed empirical patterns are invoked as load-bearing steps. The derivation chain is therefore self-contained against the reported empirical outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on abstract only. The central claim rests on the effectiveness of the MMA design, which is introduced as a new mechanism without independent verification beyond the stated benchmark gains. No free parameters, axioms, or invented entities beyond the MMA mechanism itself are described.

invented entities (1)

modality-mutual attention (MMA) no independent evidence
purpose: To allow image tokens to attend to text tokens by modifying causal attention
New attention design proposed to address misalignment; no independent evidence provided outside the paper's claims.

pith-pipeline@v0.9.0 · 5777 in / 1115 out tokens · 52660 ms · 2026-05-23T01:19:48.007663+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RAVE: Re-Allocating Visual Attention in Large Multimodal Models
cs.CV 2026-05 unverdicted novelty 5.0

RAVE is a lightweight pair-gating mechanism that adds a learned bias to pre-softmax attention over visual keys in LMMs, yielding an average 3-point gain on multimodal benchmarks with larger improvements on perception tasks.
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
cs.CV 2026-04 unverdicted novelty 5.0

An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 2 Pith papers · 14 internal anchors

[1]

Abdin, M.I., Jacobs, S.A., Awan, A.A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H.S., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C.C.T., Chen, W., Chaudhary, V ., Chopra, P., Giorno, A.D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J....

work page
[2]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Phi-3 technical report: A highly capable language model locally on your phone. CoRR abs/2404.14219. 10

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Flamingo: a visual language model for few-shot learning, in: NeurIPS

Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K., 2022. Flam...

work page 2022
[4]

Gemini: A Family of Highly Capable Multimodal Models

Anil, R., Borgeaud, S., Wu, Y ., Alayrac, J., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., Silver, D., Petrov, S., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T.P., Lazaridou, A., Firat, O., Molloy, J., Isard, M., Barham, P.R., Hennigan, T., Lee, B., Viola, F., Reynolds, M., Xu, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Claude 3.5 sonnet

Anthropic, 2024. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet

work page 2024
[6]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y ., Zhu, W., Marathe, K., Bitton, Y ., Gadre, S.Y ., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L., 2023. Openflamingo: An open-source framework for training large autoregressive vision- language models. CoRR abs/2308.01390

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

BLIP3-KALE: knowledge augmented large-scale dense captions

Awadalla, A., Xue, L., Shu, M., Yan, A., Wang, J., Purushwalkam, S., Shen, S., Lee, H., Lo, O., Park, J.S., Guha, E., Savarese, S., Schmidt, L., Choi, Y ., Xiong, C., Xu, R., 2024. BLIP3-KALE: knowledge augmented large-scale dense captions. CoRR abs/2411.07461

work page arXiv 2024
[8]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J., 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

The revolution of multimodal large language models: A survey, in: ACL (Findings), Association for Computational Linguistics

Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R., 2024. The revolution of multimodal large language models: A survey, in: ACL (Findings), Association for Computational Linguistics. pp. 13590–13618

work page 2024
[10]

Honeybee: Locality-enhanced projector for multimodal LLM, in: CVPR, IEEE

Cha, J., Kang, W., Mun, J., Roh, B., 2024. Honeybee: Locality-enhanced projector for multimodal LLM, in: CVPR, IEEE. pp. 13817–13827

work page 2024
[11]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y ., Park, J.S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., Lu, J., Anderson, T., Bransom, E., Ehsani, K., Ngo, H., Chen, Y ., Patel, A., Yatskar, M., Callison-Burch, C., Head, A., Hendrix, R., Bastani, F., VanderBilt, E., Lambert, N., Chou, Y ., Chheda, A., Sparks, J., Skjonsberg, S., Schmitz, M...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Scalable vision language model training via high quality data curation

Dong, H., Kang, Z., Yin, W., Liang, X., Feng, C., Ran, J., 2025. Scalable vision language model training via high quality data curation. arXiv preprint arXiv:2501.05952

work page arXiv 2025
[13]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, in: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al., 2024. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, in: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11198–11201

work page 2024
[14]

Multimodal autoregressive pre-training of large vision encoders

Fini, E., Shukor, M., Li, X., Dufter, P., Klein, M., Haldimann, D., Aitharaju, S., da Costa, V .G.T., Béthune, L., Gan, Z., Toshev, A.T., Eichner, M., Nabi, M., Yang, Y ., Susskind, J.M., El-Nouby, A., 2024. Multimodal autoregressive pre-training of large vision encoders. CoRR abs/2411.14402

work page arXiv 2024
[15]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., Li, K., Sun, X., Ji, R., 2023. MME: A comprehensive evaluation benchmark for multimodal large language models. CoRR abs/2306.13394. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Datacomp: In search of the next generation of multimodal datasets, in: NeurIPS

Gadre, S.Y ., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., Orgad, E., Entezari, R., Daras, G., Pratt, S.M., Ramanujan, V ., Bitton, Y ., Marathe, K., Mussmann, S., Vencu, R., Cherti, M., Krishna, R., Koh, P.W., Saukh, O., Ratner, A.J., Song, S., Hajishirzi, H., Farhadi, A., Beaumont, R., Oh, S...

work page 2023
[17]

Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in: CVPR, IEEE Computer Society

Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., Parikh, D., 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in: CVPR, IEEE Computer Society. pp. 6325–6334

work page 2017
[18]

GQA: A new dataset for real-world visual reasoning and compositional question answering, in: CVPR, Computer Vision Foundation / IEEE

Hudson, D.A., Manning, C.D., 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering, in: CVPR, Computer Vision Foundation / IEEE. pp. 6700–6709

work page 2019
[19]

Referitgame: Referring to objects in photographs of natural scenes, in: EMNLP, ACL

Kazemzadeh, S., Ordonez, V ., Matten, M., Berg, T.L., 2014. Referitgame: Referring to objects in photographs of natural scenes, in: EMNLP, ACL. pp. 787–798

work page 2014
[20]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Krishna, R., Zhu, Y ., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y ., Li, L., Shamma, D.A., Bernstein, M.S., Fei-Fei, L., 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73

work page 2017
[21]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., Shan, Y ., 2023a. Seed-bench: Benchmarking multimodal llms with generative comprehension. CoRR abs/2307.16125

work page internal anchor Pith review Pith/arXiv arXiv
[22]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y ., Liu, Z., Li, C., 2024. Llava-onevision: Easy visual task transfer. CoRR abs/2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models, in: ICML, PMLR

Li, J., Li, D., Savarese, S., Hoi, S.C.H., 2023b. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models, in: ICML, PMLR. pp. 19730–19742

work page
[24]

Evaluating object hallucination in large vision-language models, in: EMNLP, Association for Computational Linguistics

Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W.X., Wen, J., 2023c. Evaluating object hallucination in large vision-language models, in: EMNLP, Association for Computational Linguistics. pp. 292–305

work page
[25]

VILA: on pre-training for visual language models, in: CVPR, IEEE

Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S., 2024. VILA: on pre-training for visual language models, in: CVPR, IEEE. pp. 26679–26689

work page 2024
[26]

Visual spatial reasoning

Liu, F., Emerson, G., Collier, N., 2023a. Visual spatial reasoning. Trans. Assoc. Comput. Linguistics 11, 635–651

work page
[27]

Improved baselines with visual instruction tuning, in: CVPR, IEEE

Liu, H., Li, C., Li, Y ., Lee, Y .J., 2024a. Improved baselines with visual instruction tuning, in: CVPR, IEEE. pp. 26286–26296

work page
[28]

Visual instruction tuning, in: NeurIPS

Liu, H., Li, C., Wu, Q., Lee, Y .J., 2023b. Visual instruction tuning, in: NeurIPS

work page
[29]

A Survey on Hallucination in Large Vision-Language Models

Liu, H., Xue, W., Chen, Y ., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W., 2024b. A survey on hallucination in large vision-language models. CoRR abs/2402.00253

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Mmbench: Is your multi-modal model an all-around player?, in: ECCV (6), Springer

Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., Chen, K., Lin, D., 2024c. Mmbench: Is your multi-modal model an all-around player?, in: ECCV (6), Springer. pp. 216–233

work page
[31]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y ., Deng, C., Xu, H., Xie, Z., Ruan, C., 2024a. Deepseek-vl: Towards real-world vision-language understanding. arXiv:2403.05525

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, in: ICLR, OpenReview.net

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K., Galley, M., Gao, J., 2024b. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, in: ICLR, OpenReview.net

work page
[33]

Learn to explain: Multimodal reasoning via thought chains for science question answering, in: NeurIPS

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K., Zhu, S., Tafjord, O., Clark, P., Kalyan, A., 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering, in: NeurIPS. 12

work page 2022
[34]

Generation and comprehension of unambiguous object descriptions, in: CVPR, IEEE Computer Society

Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K., 2016. Generation and comprehension of unambiguous object descriptions, in: CVPR, IEEE Computer Society. pp. 11–20

work page 2016
[35]

MM1: methods, analysis and insights from multimodal LLM pre-training, in: ECCV (29), Springer

McKinzie, B., Gan, Z., Fauconnier, J., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Belyi, A., Zhang, H., Singh, K., Kang, D., Hè, H., Schwarzer, M., Gunter, T., Kong, X., Zhang, A., Wang, J., Wang, C., Du, N., Lei, T., Wiseman, S., Lee, M., Wang, Z., Pang, R., Grasch, P., Toshev, A., Yang, Y ., 2024. MM1: methods, analysis and insights f...

work page 2024
[36]

OCR-VQA: visual question answering by reading text in images, in: ICDAR, IEEE

Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A., 2019. OCR-VQA: visual question answering by reading text in images, in: ICDAR, IEEE. pp. 947–952

work page 2019
[37]

A primer on sports analytics: A new dimension of sports

OpenAI, 2024. A primer on sports analytics: A new dimension of sports. https://openai.com/index/hello-gpt-4o/

work page 2024
[38]

How do you like los angeles’ new parking signs? URL: https://www.npr.org/sections/thetwo-way/2015/04/06/397858800/ how-do-you-like-los-angeles-new-parking-signs

Sanders, S., 2015. How do you like los angeles’ new parking signs? URL: https://www.npr.org/sections/thetwo-way/2015/04/06/397858800/ how-do-you-like-los-angeles-new-parking-signs

work page 2015
[39]

A-OKVQA: A benchmark for visual question answering using world knowledge, in: ECCV (8), Springer

Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R., 2022. A-OKVQA: A benchmark for visual question answering using world knowledge, in: ECCV (8), Springer. pp. 146–162

work page 2022
[40]

Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems

Tong, S., II, E.L.B., Wu, P., Woo, S., IYER, A.J., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., Pan, X., Fergus, R., LeCun, Y ., Xie, S., 2024. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id= Vi8AepAXGy

work page 2024
[41]

Attention is all you need, in: NIPS, pp

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need, in: NIPS, pp. 5998–6008

work page 2017
[42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y ., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J., 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. CoRR abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y ., Wu, C., Wang, B., Xie, Z., Wu, Y ., Hu, K., Wang, J., Sun, Y ., Li, Y ., Piao, Y ., Guan, K., Liu, A., Xie, X., You, Y ., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y ., Ruan, C., 2024. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Realworldqa

xAI, 2024. Realworldqa. URL: https://x.ai/blog/grok-1.5v

work page 2024
[45]

Mitigating object hallucination via concentric causal attention, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems

Xing, Y ., Li, Y ., Laptev, I., Lu, S., 2024. Mitigating object hallucination via concentric causal attention, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id=CIRPE1bSmV

work page 2024
[46]

xgen-mm (blip-3): A family of open large multimodal models

Xue, L., Shu, M., Awadalla, A., Wang, J., Yan, A., Purushwalkam, S., Zhou, H., Prabhu, V ., Dai, Y ., Ryoo, M.S., Kendre, S., Zhang, J., Qin, C., Zhang, S., Chen, C., Yu, N., Tan, J., Awalgaonkar, T.M., Heinecke, S., Wang, H., Choi, Y ., Schmidt, L., Chen, Z., Savarese, S., Niebles, J.C., Xiong, C., Xu, R., 2024. xgen-mm (BLIP-3): A family of open large m...

work page arXiv 2024
[47]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yao, Y ., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, H., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., Sun, M., 2024. Minicpm-v: A GPT-4V level MLLM on your phone. CoRR abs/2408.01800

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

A Survey on Multimodal Large Language Models

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E., 2023. A survey on multimodal large language models. CoRR abs/2306.13549. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Ferret: Refer and ground anything anywhere at any granularity, in: ICLR, OpenReview.net

You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S., Yang, Y ., 2024. Ferret: Refer and ground anything anywhere at any granularity, in: ICLR, OpenReview.net

work page 2024
[50]

Modeling context in referring expressions, in: ECCV (2), Springer

Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L., 2016. Modeling context in referring expressions, in: ECCV (2), Springer. pp. 69–85

work page 2016
[51]

Mm-vet: Evaluating large multimodal models for integrated capabilities, in: ICML, OpenReview.net

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L., 2024. Mm-vet: Evaluating large multimodal models for integrated capabilities, in: ICML, OpenReview.net

work page 2024
[52]

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI, in: CVPR, IEEE

Yue, X., Ni, Y ., Zheng, T., Zhang, K., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., Chen, W., 2024. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI, in: CVPR, IEEE. pp. 9556–9567

work page 2024
[53]

Sigmoid loss for language image pre-training, in: ICCV, IEEE

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L., 2023. Sigmoid loss for language image pre-training, in: ICCV, IEEE. pp. 11941–11952

work page 2023
[54]

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning.arXiv:2409.20566, 2024

Zhang, H., Gao, M., Gan, Z., Dufter, P., Wenzel, N., Huang, F., Shah, D., Du, X., Zhang, B., Li, Y ., Dodge, S., You, K., Yang, Z., Timofeev, A., Xu, M., Chen, H., Fauconnier, J., Lai, Z., You, H., Wang, Z., Dehghan, A., Grasch, P., Yang, Y ., 2024. MM1.5: methods, analysis & insights from multimodal LLM fine-tuning. CoRR abs/2409.20566

work page arXiv 2024
[55]

Provide a short description for this region

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M., 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models, in: ICLR, OpenReview.net. 14 Configurations Pre-Training Supervised Finetuning Vision Encoder siglip-so400m-patch14-384 VL-Connector Perceiver Resampler LLM Phi-3.5-mini-instruct Trainable Modules VL-Connector, ...

work page 2024

[1] [1]

Abdin, M.I., Jacobs, S.A., Awan, A.A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H.S., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C.C.T., Chen, W., Chaudhary, V ., Chopra, P., Giorno, A.D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J....

work page

[2] [2]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Phi-3 technical report: A highly capable language model locally on your phone. CoRR abs/2404.14219. 10

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Flamingo: a visual language model for few-shot learning, in: NeurIPS

Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K., 2022. Flam...

work page 2022

[4] [4]

Gemini: A Family of Highly Capable Multimodal Models

Anil, R., Borgeaud, S., Wu, Y ., Alayrac, J., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., Silver, D., Petrov, S., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T.P., Lazaridou, A., Firat, O., Molloy, J., Isard, M., Barham, P.R., Hennigan, T., Lee, B., Viola, F., Reynolds, M., Xu, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Claude 3.5 sonnet

Anthropic, 2024. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet

work page 2024

[6] [6]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y ., Zhu, W., Marathe, K., Bitton, Y ., Gadre, S.Y ., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L., 2023. Openflamingo: An open-source framework for training large autoregressive vision- language models. CoRR abs/2308.01390

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

BLIP3-KALE: knowledge augmented large-scale dense captions

Awadalla, A., Xue, L., Shu, M., Yan, A., Wang, J., Purushwalkam, S., Shen, S., Lee, H., Lo, O., Park, J.S., Guha, E., Savarese, S., Schmidt, L., Choi, Y ., Xiong, C., Xu, R., 2024. BLIP3-KALE: knowledge augmented large-scale dense captions. CoRR abs/2411.07461

work page arXiv 2024

[8] [8]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J., 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR abs/2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

The revolution of multimodal large language models: A survey, in: ACL (Findings), Association for Computational Linguistics

Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R., 2024. The revolution of multimodal large language models: A survey, in: ACL (Findings), Association for Computational Linguistics. pp. 13590–13618

work page 2024

[10] [10]

Honeybee: Locality-enhanced projector for multimodal LLM, in: CVPR, IEEE

Cha, J., Kang, W., Mun, J., Roh, B., 2024. Honeybee: Locality-enhanced projector for multimodal LLM, in: CVPR, IEEE. pp. 13817–13827

work page 2024

[11] [11]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y ., Park, J.S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., Lu, J., Anderson, T., Bransom, E., Ehsani, K., Ngo, H., Chen, Y ., Patel, A., Yatskar, M., Callison-Burch, C., Head, A., Hendrix, R., Bastani, F., VanderBilt, E., Lambert, N., Chou, Y ., Chheda, A., Sparks, J., Skjonsberg, S., Schmitz, M...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Scalable vision language model training via high quality data curation

Dong, H., Kang, Z., Yin, W., Liang, X., Feng, C., Ran, J., 2025. Scalable vision language model training via high quality data curation. arXiv preprint arXiv:2501.05952

work page arXiv 2025

[13] [13]

Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, in: Proceedings of the 32nd ACM International Conference on Multimedia, pp

Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al., 2024. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, in: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11198–11201

work page 2024

[14] [14]

Multimodal autoregressive pre-training of large vision encoders

Fini, E., Shukor, M., Li, X., Dufter, P., Klein, M., Haldimann, D., Aitharaju, S., da Costa, V .G.T., Béthune, L., Gan, Z., Toshev, A.T., Eichner, M., Nabi, M., Yang, Y ., Susskind, J.M., El-Nouby, A., 2024. Multimodal autoregressive pre-training of large vision encoders. CoRR abs/2411.14402

work page arXiv 2024

[15] [15]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., Li, K., Sun, X., Ji, R., 2023. MME: A comprehensive evaluation benchmark for multimodal large language models. CoRR abs/2306.13394. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Datacomp: In search of the next generation of multimodal datasets, in: NeurIPS

Gadre, S.Y ., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., Orgad, E., Entezari, R., Daras, G., Pratt, S.M., Ramanujan, V ., Bitton, Y ., Marathe, K., Mussmann, S., Vencu, R., Cherti, M., Krishna, R., Koh, P.W., Saukh, O., Ratner, A.J., Song, S., Hajishirzi, H., Farhadi, A., Beaumont, R., Oh, S...

work page 2023

[17] [17]

Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in: CVPR, IEEE Computer Society

Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., Parikh, D., 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in: CVPR, IEEE Computer Society. pp. 6325–6334

work page 2017

[18] [18]

GQA: A new dataset for real-world visual reasoning and compositional question answering, in: CVPR, Computer Vision Foundation / IEEE

Hudson, D.A., Manning, C.D., 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering, in: CVPR, Computer Vision Foundation / IEEE. pp. 6700–6709

work page 2019

[19] [19]

Referitgame: Referring to objects in photographs of natural scenes, in: EMNLP, ACL

Kazemzadeh, S., Ordonez, V ., Matten, M., Berg, T.L., 2014. Referitgame: Referring to objects in photographs of natural scenes, in: EMNLP, ACL. pp. 787–798

work page 2014

[20] [20]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Krishna, R., Zhu, Y ., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y ., Li, L., Shamma, D.A., Bernstein, M.S., Fei-Fei, L., 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73

work page 2017

[21] [21]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., Shan, Y ., 2023a. Seed-bench: Benchmarking multimodal llms with generative comprehension. CoRR abs/2307.16125

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y ., Liu, Z., Li, C., 2024. Llava-onevision: Easy visual task transfer. CoRR abs/2408.03326

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models, in: ICML, PMLR

Li, J., Li, D., Savarese, S., Hoi, S.C.H., 2023b. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models, in: ICML, PMLR. pp. 19730–19742

work page

[24] [24]

Evaluating object hallucination in large vision-language models, in: EMNLP, Association for Computational Linguistics

Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W.X., Wen, J., 2023c. Evaluating object hallucination in large vision-language models, in: EMNLP, Association for Computational Linguistics. pp. 292–305

work page

[25] [25]

VILA: on pre-training for visual language models, in: CVPR, IEEE

Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S., 2024. VILA: on pre-training for visual language models, in: CVPR, IEEE. pp. 26679–26689

work page 2024

[26] [26]

Visual spatial reasoning

Liu, F., Emerson, G., Collier, N., 2023a. Visual spatial reasoning. Trans. Assoc. Comput. Linguistics 11, 635–651

work page

[27] [27]

Improved baselines with visual instruction tuning, in: CVPR, IEEE

Liu, H., Li, C., Li, Y ., Lee, Y .J., 2024a. Improved baselines with visual instruction tuning, in: CVPR, IEEE. pp. 26286–26296

work page

[28] [28]

Visual instruction tuning, in: NeurIPS

Liu, H., Li, C., Wu, Q., Lee, Y .J., 2023b. Visual instruction tuning, in: NeurIPS

work page

[29] [29]

A Survey on Hallucination in Large Vision-Language Models

Liu, H., Xue, W., Chen, Y ., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W., 2024b. A survey on hallucination in large vision-language models. CoRR abs/2402.00253

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Mmbench: Is your multi-modal model an all-around player?, in: ECCV (6), Springer

Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., Chen, K., Lin, D., 2024c. Mmbench: Is your multi-modal model an all-around player?, in: ECCV (6), Springer. pp. 216–233

work page

[31] [31]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y ., Deng, C., Xu, H., Xie, Z., Ruan, C., 2024a. Deepseek-vl: Towards real-world vision-language understanding. arXiv:2403.05525

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, in: ICLR, OpenReview.net

Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K., Galley, M., Gao, J., 2024b. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, in: ICLR, OpenReview.net

work page

[33] [33]

Learn to explain: Multimodal reasoning via thought chains for science question answering, in: NeurIPS

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K., Zhu, S., Tafjord, O., Clark, P., Kalyan, A., 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering, in: NeurIPS. 12

work page 2022

[34] [34]

Generation and comprehension of unambiguous object descriptions, in: CVPR, IEEE Computer Society

Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K., 2016. Generation and comprehension of unambiguous object descriptions, in: CVPR, IEEE Computer Society. pp. 11–20

work page 2016

[35] [35]

MM1: methods, analysis and insights from multimodal LLM pre-training, in: ECCV (29), Springer

McKinzie, B., Gan, Z., Fauconnier, J., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Belyi, A., Zhang, H., Singh, K., Kang, D., Hè, H., Schwarzer, M., Gunter, T., Kong, X., Zhang, A., Wang, J., Wang, C., Du, N., Lei, T., Wiseman, S., Lee, M., Wang, Z., Pang, R., Grasch, P., Toshev, A., Yang, Y ., 2024. MM1: methods, analysis and insights f...

work page 2024

[36] [36]

OCR-VQA: visual question answering by reading text in images, in: ICDAR, IEEE

Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A., 2019. OCR-VQA: visual question answering by reading text in images, in: ICDAR, IEEE. pp. 947–952

work page 2019

[37] [37]

A primer on sports analytics: A new dimension of sports

OpenAI, 2024. A primer on sports analytics: A new dimension of sports. https://openai.com/index/hello-gpt-4o/

work page 2024

[38] [38]

How do you like los angeles’ new parking signs? URL: https://www.npr.org/sections/thetwo-way/2015/04/06/397858800/ how-do-you-like-los-angeles-new-parking-signs

Sanders, S., 2015. How do you like los angeles’ new parking signs? URL: https://www.npr.org/sections/thetwo-way/2015/04/06/397858800/ how-do-you-like-los-angeles-new-parking-signs

work page 2015

[39] [39]

A-OKVQA: A benchmark for visual question answering using world knowledge, in: ECCV (8), Springer

Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R., 2022. A-OKVQA: A benchmark for visual question answering using world knowledge, in: ECCV (8), Springer. pp. 146–162

work page 2022

[40] [40]

Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems

Tong, S., II, E.L.B., Wu, P., Woo, S., IYER, A.J., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., Pan, X., Fergus, R., LeCun, Y ., Xie, S., 2024. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id= Vi8AepAXGy

work page 2024

[41] [41]

Attention is all you need, in: NIPS, pp

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need, in: NIPS, pp. 5998–6008

work page 2017

[42] [42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y ., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J., 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. CoRR abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y ., Wu, C., Wang, B., Xie, Z., Wu, Y ., Hu, K., Wang, J., Sun, Y ., Li, Y ., Piao, Y ., Guan, K., Liu, A., Xie, X., You, Y ., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y ., Ruan, C., 2024. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Realworldqa

xAI, 2024. Realworldqa. URL: https://x.ai/blog/grok-1.5v

work page 2024

[45] [45]

Mitigating object hallucination via concentric causal attention, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems

Xing, Y ., Li, Y ., Laptev, I., Lu, S., 2024. Mitigating object hallucination via concentric causal attention, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id=CIRPE1bSmV

work page 2024

[46] [46]

xgen-mm (blip-3): A family of open large multimodal models

Xue, L., Shu, M., Awadalla, A., Wang, J., Yan, A., Purushwalkam, S., Zhou, H., Prabhu, V ., Dai, Y ., Ryoo, M.S., Kendre, S., Zhang, J., Qin, C., Zhang, S., Chen, C., Yu, N., Tan, J., Awalgaonkar, T.M., Heinecke, S., Wang, H., Choi, Y ., Schmidt, L., Chen, Z., Savarese, S., Niebles, J.C., Xiong, C., Xu, R., 2024. xgen-mm (BLIP-3): A family of open large m...

work page arXiv 2024

[47] [47]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yao, Y ., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, H., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., Sun, M., 2024. Minicpm-v: A GPT-4V level MLLM on your phone. CoRR abs/2408.01800

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

A Survey on Multimodal Large Language Models

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E., 2023. A survey on multimodal large language models. CoRR abs/2306.13549. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Ferret: Refer and ground anything anywhere at any granularity, in: ICLR, OpenReview.net

You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S., Yang, Y ., 2024. Ferret: Refer and ground anything anywhere at any granularity, in: ICLR, OpenReview.net

work page 2024

[50] [50]

Modeling context in referring expressions, in: ECCV (2), Springer

Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L., 2016. Modeling context in referring expressions, in: ECCV (2), Springer. pp. 69–85

work page 2016

[51] [51]

Mm-vet: Evaluating large multimodal models for integrated capabilities, in: ICML, OpenReview.net

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L., 2024. Mm-vet: Evaluating large multimodal models for integrated capabilities, in: ICML, OpenReview.net

work page 2024

[52] [52]

MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI, in: CVPR, IEEE

Yue, X., Ni, Y ., Zheng, T., Zhang, K., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., Chen, W., 2024. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI, in: CVPR, IEEE. pp. 9556–9567

work page 2024

[53] [53]

Sigmoid loss for language image pre-training, in: ICCV, IEEE

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L., 2023. Sigmoid loss for language image pre-training, in: ICCV, IEEE. pp. 11941–11952

work page 2023

[54] [54]

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning.arXiv:2409.20566, 2024

Zhang, H., Gao, M., Gan, Z., Dufter, P., Wenzel, N., Huang, F., Shah, D., Du, X., Zhang, B., Li, Y ., Dodge, S., You, K., Yang, Z., Timofeev, A., Xu, M., Chen, H., Fauconnier, J., Lai, Z., You, H., Wang, Z., Dehghan, A., Grasch, P., Yang, Y ., 2024. MM1.5: methods, analysis & insights from multimodal LLM fine-tuning. CoRR abs/2409.20566

work page arXiv 2024

[55] [55]

Provide a short description for this region

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M., 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models, in: ICLR, OpenReview.net. 14 Configurations Pre-Training Supervised Finetuning Vision Encoder siglip-so400m-patch14-384 VL-Connector Perceiver Resampler LLM Phi-3.5-mini-instruct Trainable Modules VL-Connector, ...

work page 2024