pith. sign in

arxiv: 2503.02597 · v3 · pith:UFLH42CJnew · submitted 2025-03-04 · 💻 cs.CV · cs.AI

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Pith reviewed 2026-05-23 01:19 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal large language modelscausal attentionvision-language misalignmentmodality-mutual attentionmultimodal understanding benchmarksdecoder-only architectures
0
0 comments X

The pith

Unlocking causal attention into modality-mutual attention lets image tokens attend to text tokens and raises multimodal LLM performance on twelve benchmarks by 6.2 percent on average with no added parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that vision-language misalignment arises because decoder-only LLMs use causal attention, which blocks image tokens from incorporating information carried by later text tokens. Replacing the causal mask with modality-mutual attention removes this restriction while keeping intra-modality causality intact. The resulting architecture is tested on three different LLM backbones and delivers state-of-the-art scores across twelve multimodal understanding benchmarks without introducing any new trainable parameters. The design is presented as generic enough to apply to other modality pairs and larger multimodal settings.

Core claim

The paper establishes that converting the standard causal attention mask into modality-mutual attention enables image tokens to attend directly to text tokens, thereby reducing vision-language misalignment and producing higher accuracy on multimodal tasks across multiple model backbones without any increase in parameter count.

What carries the argument

Modality-mutual attention (MMA), a modified attention mask that permits image tokens to attend to text tokens while preserving the original causal ordering within each modality.

If this is right

  • MMA raises average performance by 6.2 percent across twelve multimodal understanding benchmarks on three different LLM backbones.
  • The change requires no additional parameters.
  • The same attention modification can be applied to other pairs of modalities.
  • The approach scales to a range of multimodal input scenarios without architectural redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the attention mask change is the decisive factor, comparable mask adjustments could be tested in models that process audio or video before text.
  • The reported gains may depend on the specific instruction-tuning data; repeating the experiments on held-out multimodal datasets would test robustness.
  • Extending the mutual attention pattern to three or more modalities at once remains an open implementation question left by the paper.

Load-bearing premise

Vision-language misalignment is caused mainly by image tokens being unable to see subsequent text tokens under causal attention.

What would settle it

An ablation that enables image-to-text attention yet shows no gain or a drop in factual alignment on the same twelve benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2503.02597 by Helen Suzuki, Wei-Yao Wang, Yoshiyuki Kobayashi, Zhao Wang.

Figure 1
Figure 1. Figure 1: An illustration of the vision-centric scenario. The image contains ambiguous signs with [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The conventional framework for MLLMs (e.g., Molmo [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An illustration for dual-order training, where T and I indicate text and images, respectively. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The prompt template for the I&T and T&I input orders. {image patch} and {question} are [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An illustration for our proposed modality-mutual attention (MMA), which modifies the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustrations sampled from the Blip3-kale dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustrations sampled from the Blip3-OCR dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparisons among the conventional training pipeline ((I&T) [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparisons among the conventional training pipeline ((I&T) [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparisons among the conventional training pipeline ((I&T) [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparisons among the conventional training pipeline ((I&T) [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (+6.2% on average across 3 LLMs backbones) without introducing additional parameters. Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that vision-language misalignment in MLLMs arises primarily because decoder-only LLMs use causal attention, which prevents earlier image tokens from attending to later text tokens. It proposes modality-mutual attention (MMA) by unlocking the causal mask to enable bidirectional cross-modal attention. This change is reported to yield state-of-the-art results on 12 multimodal understanding benchmarks (+6.2% average across three LLM backbones) with no added parameters; the design is presented as generic and scalable to other modalities.

Significance. If the performance gains can be robustly attributed to MMA via isolating controls, the result would be significant: a parameter-free architectural modification to the core attention mechanism that improves factual alignment in MLLMs. The absence of extra parameters and the claim of applicability across backbones are clear strengths. The work would encourage re-examination of causal masking assumptions in multimodal decoder-only models.

major comments (2)
  1. [Abstract] Abstract: the central claim that MMA resolves misalignment caused by causal attention and produces the reported gains rests on benchmark numbers, yet the abstract (and by extension the manuscript) provides no ablation that holds training data, optimizer, and all other factors fixed while toggling only the image-to-text attention direction. Without this control, attribution of the +6.2% average to the proposed mechanism cannot be verified.
  2. [Abstract] Abstract / Experimental results: no evaluation is described that checks whether pure-language modeling performance remains intact after the attention-mask change. This is load-bearing for the claim that MMA introduces no new misalignment or degradation.
minor comments (1)
  1. [Abstract] The abstract states results on '12 multimodal understanding benchmarks' but does not enumerate them or the three LLM backbones; adding this list would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. The concerns about isolating the contribution of the attention-mask change and verifying language-only performance are valid for strengthening attribution. We respond to each point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that MMA resolves misalignment caused by causal attention and produces the reported gains rests on benchmark numbers, yet the abstract (and by extension the manuscript) provides no ablation that holds training data, optimizer, and all other factors fixed while toggling only the image-to-text attention direction. Without this control, attribution of the +6.2% average to the proposed mechanism cannot be verified.

    Authors: We agree that a more granular ablation isolating only the image-to-text direction would strengthen the causal attribution. Our reported results compare baseline causal-attention models against MMA versions under identical training data, optimizer, learning rate schedule, and all other hyperparameters, with the sole change being the attention mask that enables image tokens to attend to subsequent text tokens. To directly address the referee's request, we will add a dedicated ablation table in the revised manuscript that holds every factor fixed and toggles solely the image-to-text attention direction (while preserving text-to-image causality) to quantify its isolated contribution to the observed gains. revision: yes

  2. Referee: [Abstract] Abstract / Experimental results: no evaluation is described that checks whether pure-language modeling performance remains intact after the attention-mask change. This is load-bearing for the claim that MMA introduces no new misalignment or degradation.

    Authors: When the input sequence contains only text tokens, the MMA mask is identical to the original causal mask because no image tokens exist to create cross-modal interactions. Consequently, language-only behavior is unchanged by construction. Nevertheless, to make this explicit and address the referee's concern, we will include language-only benchmark results (e.g., on standard text-only suites) in the revised experimental section to empirically confirm that no degradation or new misalignment is introduced. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark gains rest on reported results, not self-referential equations or fits

full rationale

The paper proposes replacing causal attention with modality-mutual attention (MMA) to allow image tokens to attend to text tokens, claiming this resolves vision-language misalignment and yields +6.2% average gains on 12 benchmarks across 3 backbones without added parameters. No equations, parameter fits, or derivations appear in the provided text. The central claim is an architectural modification validated by external benchmark numbers rather than any quantity that reduces to its own inputs by construction. No self-citation chains, ansatzes smuggled via prior work, or renamed empirical patterns are invoked as load-bearing steps. The derivation chain is therefore self-contained against the reported empirical outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on abstract only. The central claim rests on the effectiveness of the MMA design, which is introduced as a new mechanism without independent verification beyond the stated benchmark gains. No free parameters, axioms, or invented entities beyond the MMA mechanism itself are described.

invented entities (1)
  • modality-mutual attention (MMA) no independent evidence
    purpose: To allow image tokens to attend to text tokens by modifying causal attention
    New attention design proposed to address misalignment; no independent evidence provided outside the paper's claims.

pith-pipeline@v0.9.0 · 5777 in / 1115 out tokens · 52660 ms · 2026-05-23T01:19:48.007663+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RAVE: Re-Allocating Visual Attention in Large Multimodal Models

    cs.CV 2026-05 unverdicted novelty 5.0

    RAVE is a lightweight pair-gating mechanism that adds a learned bias to pre-softmax attention over visual keys in LMMs, yielding an average 3-point gain on multimodal benchmarks with larger improvements on perception tasks.

  2. Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow

    cs.CV 2026-04 unverdicted novelty 5.0

    An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 2 Pith papers · 14 internal anchors

  1. [1]

    Abdin, M.I., Jacobs, S.A., Awan, A.A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H.S., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C.C.T., Chen, W., Chaudhary, V ., Chopra, P., Giorno, A.D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J....

  2. [2]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Phi-3 technical report: A highly capable language model locally on your phone. CoRR abs/2404.14219. 10

  3. [3]

    Flamingo: a visual language model for few-shot learning, in: NeurIPS

    Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K., 2022. Flam...

  4. [4]

    Gemini: A Family of Highly Capable Multimodal Models

    Anil, R., Borgeaud, S., Wu, Y ., Alayrac, J., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., Silver, D., Petrov, S., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T.P., Lazaridou, A., Firat, O., Molloy, J., Isard, M., Barham, P.R., Hennigan, T., Lee, B., Viola, F., Reynolds, M., Xu, Y...

  5. [5]

    Claude 3.5 sonnet

    Anthropic, 2024. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet

  6. [6]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y ., Zhu, W., Marathe, K., Bitton, Y ., Gadre, S.Y ., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L., 2023. Openflamingo: An open-source framework for training large autoregressive vision- language models. CoRR abs/2308.01390

  7. [7]

    BLIP3-KALE: knowledge augmented large-scale dense captions

    Awadalla, A., Xue, L., Shu, M., Yan, A., Wang, J., Purushwalkam, S., Shen, S., Lee, H., Lo, O., Park, J.S., Guha, E., Savarese, S., Schmidt, L., Choi, Y ., Xiong, C., Xu, R., 2024. BLIP3-KALE: knowledge augmented large-scale dense captions. CoRR abs/2411.07461

  8. [8]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J., 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR abs/2308.12966

  9. [9]

    The revolution of multimodal large language models: A survey, in: ACL (Findings), Association for Computational Linguistics

    Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R., 2024. The revolution of multimodal large language models: A survey, in: ACL (Findings), Association for Computational Linguistics. pp. 13590–13618

  10. [10]

    Honeybee: Locality-enhanced projector for multimodal LLM, in: CVPR, IEEE

    Cha, J., Kang, W., Mun, J., Roh, B., 2024. Honeybee: Locality-enhanced projector for multimodal LLM, in: CVPR, IEEE. pp. 13817–13827

  11. [11]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y ., Park, J.S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., Lu, J., Anderson, T., Bransom, E., Ehsani, K., Ngo, H., Chen, Y ., Patel, A., Yatskar, M., Callison-Burch, C., Head, A., Hendrix, R., Bastani, F., VanderBilt, E., Lambert, N., Chou, Y ., Chheda, A., Sparks, J., Skjonsberg, S., Schmitz, M...

  12. [12]

    Scalable vision language model training via high quality data curation

    Dong, H., Kang, Z., Yin, W., Liang, X., Feng, C., Ran, J., 2025. Scalable vision language model training via high quality data curation. arXiv preprint arXiv:2501.05952

  13. [13]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, in: Proceedings of the 32nd ACM International Conference on Multimedia, pp

    Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al., 2024. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, in: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11198–11201

  14. [14]

    Multimodal autoregressive pre-training of large vision encoders

    Fini, E., Shukor, M., Li, X., Dufter, P., Klein, M., Haldimann, D., Aitharaju, S., da Costa, V .G.T., Béthune, L., Gan, Z., Toshev, A.T., Eichner, M., Nabi, M., Yang, Y ., Susskind, J.M., El-Nouby, A., 2024. Multimodal autoregressive pre-training of large vision encoders. CoRR abs/2411.14402

  15. [15]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., Li, K., Sun, X., Ji, R., 2023. MME: A comprehensive evaluation benchmark for multimodal large language models. CoRR abs/2306.13394. 11

  16. [16]

    Datacomp: In search of the next generation of multimodal datasets, in: NeurIPS

    Gadre, S.Y ., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., Orgad, E., Entezari, R., Daras, G., Pratt, S.M., Ramanujan, V ., Bitton, Y ., Marathe, K., Mussmann, S., Vencu, R., Cherti, M., Krishna, R., Koh, P.W., Saukh, O., Ratner, A.J., Song, S., Hajishirzi, H., Farhadi, A., Beaumont, R., Oh, S...

  17. [17]

    Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in: CVPR, IEEE Computer Society

    Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., Parikh, D., 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in: CVPR, IEEE Computer Society. pp. 6325–6334

  18. [18]

    GQA: A new dataset for real-world visual reasoning and compositional question answering, in: CVPR, Computer Vision Foundation / IEEE

    Hudson, D.A., Manning, C.D., 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering, in: CVPR, Computer Vision Foundation / IEEE. pp. 6700–6709

  19. [19]

    Referitgame: Referring to objects in photographs of natural scenes, in: EMNLP, ACL

    Kazemzadeh, S., Ordonez, V ., Matten, M., Berg, T.L., 2014. Referitgame: Referring to objects in photographs of natural scenes, in: EMNLP, ACL. pp. 787–798

  20. [20]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Krishna, R., Zhu, Y ., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y ., Li, L., Shamma, D.A., Bernstein, M.S., Fei-Fei, L., 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73

  21. [21]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., Shan, Y ., 2023a. Seed-bench: Benchmarking multimodal llms with generative comprehension. CoRR abs/2307.16125

  22. [22]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y ., Liu, Z., Li, C., 2024. Llava-onevision: Easy visual task transfer. CoRR abs/2408.03326

  23. [23]

    BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models, in: ICML, PMLR

    Li, J., Li, D., Savarese, S., Hoi, S.C.H., 2023b. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models, in: ICML, PMLR. pp. 19730–19742

  24. [24]

    Evaluating object hallucination in large vision-language models, in: EMNLP, Association for Computational Linguistics

    Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W.X., Wen, J., 2023c. Evaluating object hallucination in large vision-language models, in: EMNLP, Association for Computational Linguistics. pp. 292–305

  25. [25]

    VILA: on pre-training for visual language models, in: CVPR, IEEE

    Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S., 2024. VILA: on pre-training for visual language models, in: CVPR, IEEE. pp. 26679–26689

  26. [26]

    Visual spatial reasoning

    Liu, F., Emerson, G., Collier, N., 2023a. Visual spatial reasoning. Trans. Assoc. Comput. Linguistics 11, 635–651

  27. [27]

    Improved baselines with visual instruction tuning, in: CVPR, IEEE

    Liu, H., Li, C., Li, Y ., Lee, Y .J., 2024a. Improved baselines with visual instruction tuning, in: CVPR, IEEE. pp. 26286–26296

  28. [28]

    Visual instruction tuning, in: NeurIPS

    Liu, H., Li, C., Wu, Q., Lee, Y .J., 2023b. Visual instruction tuning, in: NeurIPS

  29. [29]

    A Survey on Hallucination in Large Vision-Language Models

    Liu, H., Xue, W., Chen, Y ., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W., 2024b. A survey on hallucination in large vision-language models. CoRR abs/2402.00253

  30. [30]

    Mmbench: Is your multi-modal model an all-around player?, in: ECCV (6), Springer

    Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., Chen, K., Lin, D., 2024c. Mmbench: Is your multi-modal model an all-around player?, in: ECCV (6), Springer. pp. 216–233

  31. [31]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y ., Deng, C., Xu, H., Xie, Z., Ruan, C., 2024a. Deepseek-vl: Towards real-world vision-language understanding. arXiv:2403.05525

  32. [32]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, in: ICLR, OpenReview.net

    Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K., Galley, M., Gao, J., 2024b. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, in: ICLR, OpenReview.net

  33. [33]

    Learn to explain: Multimodal reasoning via thought chains for science question answering, in: NeurIPS

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K., Zhu, S., Tafjord, O., Clark, P., Kalyan, A., 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering, in: NeurIPS. 12

  34. [34]

    Generation and comprehension of unambiguous object descriptions, in: CVPR, IEEE Computer Society

    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K., 2016. Generation and comprehension of unambiguous object descriptions, in: CVPR, IEEE Computer Society. pp. 11–20

  35. [35]

    MM1: methods, analysis and insights from multimodal LLM pre-training, in: ECCV (29), Springer

    McKinzie, B., Gan, Z., Fauconnier, J., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Belyi, A., Zhang, H., Singh, K., Kang, D., Hè, H., Schwarzer, M., Gunter, T., Kong, X., Zhang, A., Wang, J., Wang, C., Du, N., Lei, T., Wiseman, S., Lee, M., Wang, Z., Pang, R., Grasch, P., Toshev, A., Yang, Y ., 2024. MM1: methods, analysis and insights f...

  36. [36]

    OCR-VQA: visual question answering by reading text in images, in: ICDAR, IEEE

    Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A., 2019. OCR-VQA: visual question answering by reading text in images, in: ICDAR, IEEE. pp. 947–952

  37. [37]

    A primer on sports analytics: A new dimension of sports

    OpenAI, 2024. A primer on sports analytics: A new dimension of sports. https://openai.com/index/hello-gpt-4o/

  38. [38]

    How do you like los angeles’ new parking signs? URL: https://www.npr.org/sections/thetwo-way/2015/04/06/397858800/ how-do-you-like-los-angeles-new-parking-signs

    Sanders, S., 2015. How do you like los angeles’ new parking signs? URL: https://www.npr.org/sections/thetwo-way/2015/04/06/397858800/ how-do-you-like-los-angeles-new-parking-signs

  39. [39]

    A-OKVQA: A benchmark for visual question answering using world knowledge, in: ECCV (8), Springer

    Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R., 2022. A-OKVQA: A benchmark for visual question answering using world knowledge, in: ECCV (8), Springer. pp. 146–162

  40. [40]

    Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems

    Tong, S., II, E.L.B., Wu, P., Woo, S., IYER, A.J., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., Pan, X., Fergus, R., LeCun, Y ., Xie, S., 2024. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id= Vi8AepAXGy

  41. [41]

    Attention is all you need, in: NIPS, pp

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need, in: NIPS, pp. 5998–6008

  42. [42]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y ., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J., 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. CoRR abs/2409.12191

  43. [43]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y ., Wu, C., Wang, B., Xie, Z., Wu, Y ., Hu, K., Wang, J., Sun, Y ., Li, Y ., Piao, Y ., Guan, K., Liu, A., Xie, X., You, Y ., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y ., Ruan, C., 2024. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. UR...

  44. [44]

    Realworldqa

    xAI, 2024. Realworldqa. URL: https://x.ai/blog/grok-1.5v

  45. [45]

    Mitigating object hallucination via concentric causal attention, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems

    Xing, Y ., Li, Y ., Laptev, I., Lu, S., 2024. Mitigating object hallucination via concentric causal attention, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id=CIRPE1bSmV

  46. [46]

    xgen-mm (blip-3): A family of open large multimodal models

    Xue, L., Shu, M., Awadalla, A., Wang, J., Yan, A., Purushwalkam, S., Zhou, H., Prabhu, V ., Dai, Y ., Ryoo, M.S., Kendre, S., Zhang, J., Qin, C., Zhang, S., Chen, C., Yu, N., Tan, J., Awalgaonkar, T.M., Heinecke, S., Wang, H., Choi, Y ., Schmidt, L., Chen, Z., Savarese, S., Niebles, J.C., Xiong, C., Xu, R., 2024. xgen-mm (BLIP-3): A family of open large m...

  47. [47]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yao, Y ., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, H., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., Sun, M., 2024. Minicpm-v: A GPT-4V level MLLM on your phone. CoRR abs/2408.01800

  48. [48]

    A Survey on Multimodal Large Language Models

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E., 2023. A survey on multimodal large language models. CoRR abs/2306.13549. 13

  49. [49]

    Ferret: Refer and ground anything anywhere at any granularity, in: ICLR, OpenReview.net

    You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S., Yang, Y ., 2024. Ferret: Refer and ground anything anywhere at any granularity, in: ICLR, OpenReview.net

  50. [50]

    Modeling context in referring expressions, in: ECCV (2), Springer

    Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L., 2016. Modeling context in referring expressions, in: ECCV (2), Springer. pp. 69–85

  51. [51]

    Mm-vet: Evaluating large multimodal models for integrated capabilities, in: ICML, OpenReview.net

    Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L., 2024. Mm-vet: Evaluating large multimodal models for integrated capabilities, in: ICML, OpenReview.net

  52. [52]

    MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI, in: CVPR, IEEE

    Yue, X., Ni, Y ., Zheng, T., Zhang, K., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., Chen, W., 2024. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI, in: CVPR, IEEE. pp. 9556–9567

  53. [53]

    Sigmoid loss for language image pre-training, in: ICCV, IEEE

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L., 2023. Sigmoid loss for language image pre-training, in: ICCV, IEEE. pp. 11941–11952

  54. [54]

    MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning.arXiv:2409.20566, 2024

    Zhang, H., Gao, M., Gan, Z., Dufter, P., Wenzel, N., Huang, F., Shah, D., Du, X., Zhang, B., Li, Y ., Dodge, S., You, K., Yang, Z., Timofeev, A., Xu, M., Chen, H., Fauconnier, J., Lai, Z., You, H., Wang, Z., Dehghan, A., Grasch, P., Yang, Y ., 2024. MM1.5: methods, analysis & insights from multimodal LLM fine-tuning. CoRR abs/2409.20566

  55. [55]

    Provide a short description for this region

    Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M., 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models, in: ICLR, OpenReview.net. 14 Configurations Pre-Training Supervised Finetuning Vision Encoder siglip-so400m-patch14-384 VL-Connector Perceiver Resampler LLM Phi-3.5-mini-instruct Trainable Modules VL-Connector, ...