Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Pith reviewed 2026-05-23 01:19 UTC · model grok-4.3
The pith
Unlocking causal attention into modality-mutual attention lets image tokens attend to text tokens and raises multimodal LLM performance on twelve benchmarks by 6.2 percent on average with no added parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that converting the standard causal attention mask into modality-mutual attention enables image tokens to attend directly to text tokens, thereby reducing vision-language misalignment and producing higher accuracy on multimodal tasks across multiple model backbones without any increase in parameter count.
What carries the argument
Modality-mutual attention (MMA), a modified attention mask that permits image tokens to attend to text tokens while preserving the original causal ordering within each modality.
If this is right
- MMA raises average performance by 6.2 percent across twelve multimodal understanding benchmarks on three different LLM backbones.
- The change requires no additional parameters.
- The same attention modification can be applied to other pairs of modalities.
- The approach scales to a range of multimodal input scenarios without architectural redesign.
Where Pith is reading between the lines
- If the attention mask change is the decisive factor, comparable mask adjustments could be tested in models that process audio or video before text.
- The reported gains may depend on the specific instruction-tuning data; repeating the experiments on held-out multimodal datasets would test robustness.
- Extending the mutual attention pattern to three or more modalities at once remains an open implementation question left by the paper.
Load-bearing premise
Vision-language misalignment is caused mainly by image tokens being unable to see subsequent text tokens under causal attention.
What would settle it
An ablation that enables image-to-text attention yet shows no gain or a drop in factual alignment on the same twelve benchmarks would falsify the central claim.
Figures
read the original abstract
Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (+6.2% on average across 3 LLMs backbones) without introducing additional parameters. Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that vision-language misalignment in MLLMs arises primarily because decoder-only LLMs use causal attention, which prevents earlier image tokens from attending to later text tokens. It proposes modality-mutual attention (MMA) by unlocking the causal mask to enable bidirectional cross-modal attention. This change is reported to yield state-of-the-art results on 12 multimodal understanding benchmarks (+6.2% average across three LLM backbones) with no added parameters; the design is presented as generic and scalable to other modalities.
Significance. If the performance gains can be robustly attributed to MMA via isolating controls, the result would be significant: a parameter-free architectural modification to the core attention mechanism that improves factual alignment in MLLMs. The absence of extra parameters and the claim of applicability across backbones are clear strengths. The work would encourage re-examination of causal masking assumptions in multimodal decoder-only models.
major comments (2)
- [Abstract] Abstract: the central claim that MMA resolves misalignment caused by causal attention and produces the reported gains rests on benchmark numbers, yet the abstract (and by extension the manuscript) provides no ablation that holds training data, optimizer, and all other factors fixed while toggling only the image-to-text attention direction. Without this control, attribution of the +6.2% average to the proposed mechanism cannot be verified.
- [Abstract] Abstract / Experimental results: no evaluation is described that checks whether pure-language modeling performance remains intact after the attention-mask change. This is load-bearing for the claim that MMA introduces no new misalignment or degradation.
minor comments (1)
- [Abstract] The abstract states results on '12 multimodal understanding benchmarks' but does not enumerate them or the three LLM backbones; adding this list would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. The concerns about isolating the contribution of the attention-mask change and verifying language-only performance are valid for strengthening attribution. We respond to each point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that MMA resolves misalignment caused by causal attention and produces the reported gains rests on benchmark numbers, yet the abstract (and by extension the manuscript) provides no ablation that holds training data, optimizer, and all other factors fixed while toggling only the image-to-text attention direction. Without this control, attribution of the +6.2% average to the proposed mechanism cannot be verified.
Authors: We agree that a more granular ablation isolating only the image-to-text direction would strengthen the causal attribution. Our reported results compare baseline causal-attention models against MMA versions under identical training data, optimizer, learning rate schedule, and all other hyperparameters, with the sole change being the attention mask that enables image tokens to attend to subsequent text tokens. To directly address the referee's request, we will add a dedicated ablation table in the revised manuscript that holds every factor fixed and toggles solely the image-to-text attention direction (while preserving text-to-image causality) to quantify its isolated contribution to the observed gains. revision: yes
-
Referee: [Abstract] Abstract / Experimental results: no evaluation is described that checks whether pure-language modeling performance remains intact after the attention-mask change. This is load-bearing for the claim that MMA introduces no new misalignment or degradation.
Authors: When the input sequence contains only text tokens, the MMA mask is identical to the original causal mask because no image tokens exist to create cross-modal interactions. Consequently, language-only behavior is unchanged by construction. Nevertheless, to make this explicit and address the referee's concern, we will include language-only benchmark results (e.g., on standard text-only suites) in the revised experimental section to empirically confirm that no degradation or new misalignment is introduced. revision: yes
Circularity Check
No circularity: empirical benchmark gains rest on reported results, not self-referential equations or fits
full rationale
The paper proposes replacing causal attention with modality-mutual attention (MMA) to allow image tokens to attend to text tokens, claiming this resolves vision-language misalignment and yields +6.2% average gains on 12 benchmarks across 3 backbones without added parameters. No equations, parameter fits, or derivations appear in the provided text. The central claim is an architectural modification validated by external benchmark numbers rather than any quantity that reduces to its own inputs by construction. No self-citation chains, ansatzes smuggled via prior work, or renamed empirical patterns are invoked as load-bearing steps. The derivation chain is therefore self-contained against the reported empirical outcomes.
Axiom & Free-Parameter Ledger
invented entities (1)
-
modality-mutual attention (MMA)
no independent evidence
Forward citations
Cited by 2 Pith papers
-
RAVE: Re-Allocating Visual Attention in Large Multimodal Models
RAVE is a lightweight pair-gating mechanism that adds a learned bias to pre-softmax attention over visual keys in LMMs, yielding an average 3-point gain on multimodal benchmarks with larger improvements on perception tasks.
-
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.
Reference graph
Works this paper leans on
-
[1]
Abdin, M.I., Jacobs, S.A., Awan, A.A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H.S., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Mendes, C.C.T., Chen, W., Chaudhary, V ., Chopra, P., Giorno, A.D., de Rosa, G., Dixon, M., Eldan, R., Iter, D., Garg, A., Goswami, A., Gunasekar, S., Haider, E., Hao, J....
-
[2]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3 technical report: A highly capable language model locally on your phone. CoRR abs/2404.14219. 10
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Flamingo: a visual language model for few-shot learning, in: NeurIPS
Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J.L., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K., 2022. Flam...
work page 2022
-
[4]
Gemini: A Family of Highly Capable Multimodal Models
Anil, R., Borgeaud, S., Wu, Y ., Alayrac, J., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., Silver, D., Petrov, S., Johnson, M., Antonoglou, I., Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lillicrap, T.P., Lazaridou, A., Firat, O., Molloy, J., Isard, M., Barham, P.R., Hennigan, T., Lee, B., Viola, F., Reynolds, M., Xu, Y...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Anthropic, 2024. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet
work page 2024
-
[6]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y ., Zhu, W., Marathe, K., Bitton, Y ., Gadre, S.Y ., Sagawa, S., Jitsev, J., Kornblith, S., Koh, P.W., Ilharco, G., Wortsman, M., Schmidt, L., 2023. Openflamingo: An open-source framework for training large autoregressive vision- language models. CoRR abs/2308.01390
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
BLIP3-KALE: knowledge augmented large-scale dense captions
Awadalla, A., Xue, L., Shu, M., Yan, A., Wang, J., Purushwalkam, S., Shen, S., Lee, H., Lo, O., Park, J.S., Guha, E., Savarese, S., Schmidt, L., Choi, Y ., Xiong, C., Xu, R., 2024. BLIP3-KALE: knowledge augmented large-scale dense captions. CoRR abs/2411.07461
-
[8]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J., 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR abs/2308.12966
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Caffagni, D., Cocchi, F., Barsellotti, L., Moratelli, N., Sarto, S., Baraldi, L., Cornia, M., Cucchiara, R., 2024. The revolution of multimodal large language models: A survey, in: ACL (Findings), Association for Computational Linguistics. pp. 13590–13618
work page 2024
-
[10]
Honeybee: Locality-enhanced projector for multimodal LLM, in: CVPR, IEEE
Cha, J., Kang, W., Mun, J., Roh, B., 2024. Honeybee: Locality-enhanced projector for multimodal LLM, in: CVPR, IEEE. pp. 13817–13827
work page 2024
-
[11]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y ., Park, J.S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., Lu, J., Anderson, T., Bransom, E., Ehsani, K., Ngo, H., Chen, Y ., Patel, A., Yatskar, M., Callison-Burch, C., Head, A., Hendrix, R., Bastani, F., VanderBilt, E., Lambert, N., Chou, Y ., Chheda, A., Sparks, J., Skjonsberg, S., Schmitz, M...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Scalable vision language model training via high quality data curation
Dong, H., Kang, Z., Yin, W., Liang, X., Feng, C., Ran, J., 2025. Scalable vision language model training via high quality data curation. arXiv preprint arXiv:2501.05952
-
[13]
Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al., 2024. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models, in: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11198–11201
work page 2024
-
[14]
Multimodal autoregressive pre-training of large vision encoders
Fini, E., Shukor, M., Li, X., Dufter, P., Klein, M., Haldimann, D., Aitharaju, S., da Costa, V .G.T., Béthune, L., Gan, Z., Toshev, A.T., Eichner, M., Nabi, M., Yang, Y ., Susskind, J.M., El-Nouby, A., 2024. Multimodal autoregressive pre-training of large vision encoders. CoRR abs/2411.14402
-
[15]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., Li, K., Sun, X., Ji, R., 2023. MME: A comprehensive evaluation benchmark for multimodal large language models. CoRR abs/2306.13394. 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Datacomp: In search of the next generation of multimodal datasets, in: NeurIPS
Gadre, S.Y ., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., Orgad, E., Entezari, R., Daras, G., Pratt, S.M., Ramanujan, V ., Bitton, Y ., Marathe, K., Mussmann, S., Vencu, R., Cherti, M., Krishna, R., Koh, P.W., Saukh, O., Ratner, A.J., Song, S., Hajishirzi, H., Farhadi, A., Beaumont, R., Oh, S...
work page 2023
-
[17]
Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., Parikh, D., 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering, in: CVPR, IEEE Computer Society. pp. 6325–6334
work page 2017
-
[18]
Hudson, D.A., Manning, C.D., 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering, in: CVPR, Computer Vision Foundation / IEEE. pp. 6700–6709
work page 2019
-
[19]
Referitgame: Referring to objects in photographs of natural scenes, in: EMNLP, ACL
Kazemzadeh, S., Ordonez, V ., Matten, M., Berg, T.L., 2014. Referitgame: Referring to objects in photographs of natural scenes, in: EMNLP, ACL. pp. 787–798
work page 2014
-
[20]
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Krishna, R., Zhu, Y ., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y ., Li, L., Shamma, D.A., Bernstein, M.S., Fei-Fei, L., 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73
work page 2017
-
[21]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Li, B., Wang, R., Wang, G., Ge, Y ., Ge, Y ., Shan, Y ., 2023a. Seed-bench: Benchmarking multimodal llms with generative comprehension. CoRR abs/2307.16125
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Li, Y ., Liu, Z., Li, C., 2024. Llava-onevision: Easy visual task transfer. CoRR abs/2408.03326
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Li, J., Li, D., Savarese, S., Hoi, S.C.H., 2023b. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models, in: ICML, PMLR. pp. 19730–19742
-
[24]
Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W.X., Wen, J., 2023c. Evaluating object hallucination in large vision-language models, in: EMNLP, Association for Computational Linguistics. pp. 292–305
-
[25]
VILA: on pre-training for visual language models, in: CVPR, IEEE
Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S., 2024. VILA: on pre-training for visual language models, in: CVPR, IEEE. pp. 26679–26689
work page 2024
-
[26]
Liu, F., Emerson, G., Collier, N., 2023a. Visual spatial reasoning. Trans. Assoc. Comput. Linguistics 11, 635–651
-
[27]
Improved baselines with visual instruction tuning, in: CVPR, IEEE
Liu, H., Li, C., Li, Y ., Lee, Y .J., 2024a. Improved baselines with visual instruction tuning, in: CVPR, IEEE. pp. 26286–26296
-
[28]
Visual instruction tuning, in: NeurIPS
Liu, H., Li, C., Wu, Q., Lee, Y .J., 2023b. Visual instruction tuning, in: NeurIPS
-
[29]
A Survey on Hallucination in Large Vision-Language Models
Liu, H., Xue, W., Chen, Y ., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W., 2024b. A survey on hallucination in large vision-language models. CoRR abs/2402.00253
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Mmbench: Is your multi-modal model an all-around player?, in: ECCV (6), Springer
Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., Chen, K., Lin, D., 2024c. Mmbench: Is your multi-modal model an all-around player?, in: ECCV (6), Springer. pp. 216–233
-
[31]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Sun, Y ., Deng, C., Xu, H., Xie, Z., Ruan, C., 2024a. Deepseek-vl: Towards real-world vision-language understanding. arXiv:2403.05525
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K., Galley, M., Gao, J., 2024b. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, in: ICLR, OpenReview.net
-
[33]
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K., Zhu, S., Tafjord, O., Clark, P., Kalyan, A., 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering, in: NeurIPS. 12
work page 2022
-
[34]
Generation and comprehension of unambiguous object descriptions, in: CVPR, IEEE Computer Society
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K., 2016. Generation and comprehension of unambiguous object descriptions, in: CVPR, IEEE Computer Society. pp. 11–20
work page 2016
-
[35]
MM1: methods, analysis and insights from multimodal LLM pre-training, in: ECCV (29), Springer
McKinzie, B., Gan, Z., Fauconnier, J., Dodge, S., Zhang, B., Dufter, P., Shah, D., Du, X., Peng, F., Belyi, A., Zhang, H., Singh, K., Kang, D., Hè, H., Schwarzer, M., Gunter, T., Kong, X., Zhang, A., Wang, J., Wang, C., Du, N., Lei, T., Wiseman, S., Lee, M., Wang, Z., Pang, R., Grasch, P., Toshev, A., Yang, Y ., 2024. MM1: methods, analysis and insights f...
work page 2024
-
[36]
OCR-VQA: visual question answering by reading text in images, in: ICDAR, IEEE
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A., 2019. OCR-VQA: visual question answering by reading text in images, in: ICDAR, IEEE. pp. 947–952
work page 2019
-
[37]
A primer on sports analytics: A new dimension of sports
OpenAI, 2024. A primer on sports analytics: A new dimension of sports. https://openai.com/index/hello-gpt-4o/
work page 2024
-
[38]
Sanders, S., 2015. How do you like los angeles’ new parking signs? URL: https://www.npr.org/sections/thetwo-way/2015/04/06/397858800/ how-do-you-like-los-angeles-new-parking-signs
work page 2015
-
[39]
A-OKVQA: A benchmark for visual question answering using world knowledge, in: ECCV (8), Springer
Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R., 2022. A-OKVQA: A benchmark for visual question answering using world knowledge, in: ECCV (8), Springer. pp. 146–162
work page 2022
-
[40]
Tong, S., II, E.L.B., Wu, P., Woo, S., IYER, A.J., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., Pan, X., Fergus, R., LeCun, Y ., Xie, S., 2024. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id= Vi8AepAXGy
work page 2024
-
[41]
Attention is all you need, in: NIPS, pp
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention is all you need, in: NIPS, pp. 5998–6008
work page 2017
-
[42]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y ., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J., 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. CoRR abs/2409.12191
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y ., Wu, C., Wang, B., Xie, Z., Wu, Y ., Hu, K., Wang, J., Sun, Y ., Li, Y ., Piao, Y ., Guan, K., Liu, A., Xie, X., You, Y ., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y ., Ruan, C., 2024. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. UR...
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [44]
-
[45]
Xing, Y ., Li, Y ., Laptev, I., Lu, S., 2024. Mitigating object hallucination via concentric causal attention, in: The Thirty-eighth Annual Conference on Neural Information Processing Systems. URL: https://openreview.net/forum?id=CIRPE1bSmV
work page 2024
-
[46]
xgen-mm (blip-3): A family of open large multimodal models
Xue, L., Shu, M., Awadalla, A., Wang, J., Yan, A., Purushwalkam, S., Zhou, H., Prabhu, V ., Dai, Y ., Ryoo, M.S., Kendre, S., Zhang, J., Qin, C., Zhang, S., Chen, C., Yu, N., Tan, J., Awalgaonkar, T.M., Heinecke, S., Wang, H., Choi, Y ., Schmidt, L., Chen, Z., Savarese, S., Niebles, J.C., Xiong, C., Xu, R., 2024. xgen-mm (BLIP-3): A family of open large m...
-
[47]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yao, Y ., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., Chen, Q., Zhou, H., Zou, Z., Zhang, H., Hu, S., Zheng, Z., Zhou, J., Cai, J., Han, X., Zeng, G., Li, D., Liu, Z., Sun, M., 2024. Minicpm-v: A GPT-4V level MLLM on your phone. CoRR abs/2408.01800
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
A Survey on Multimodal Large Language Models
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E., 2023. A survey on multimodal large language models. CoRR abs/2306.13549. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Ferret: Refer and ground anything anywhere at any granularity, in: ICLR, OpenReview.net
You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S., Yang, Y ., 2024. Ferret: Refer and ground anything anywhere at any granularity, in: ICLR, OpenReview.net
work page 2024
-
[50]
Modeling context in referring expressions, in: ECCV (2), Springer
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L., 2016. Modeling context in referring expressions, in: ECCV (2), Springer. pp. 69–85
work page 2016
-
[51]
Mm-vet: Evaluating large multimodal models for integrated capabilities, in: ICML, OpenReview.net
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L., 2024. Mm-vet: Evaluating large multimodal models for integrated capabilities, in: ICML, OpenReview.net
work page 2024
-
[52]
Yue, X., Ni, Y ., Zheng, T., Zhang, K., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., Chen, W., 2024. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI, in: CVPR, IEEE. pp. 9556–9567
work page 2024
-
[53]
Sigmoid loss for language image pre-training, in: ICCV, IEEE
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L., 2023. Sigmoid loss for language image pre-training, in: ICCV, IEEE. pp. 11941–11952
work page 2023
-
[54]
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-Tuning.arXiv:2409.20566, 2024
Zhang, H., Gao, M., Gan, Z., Dufter, P., Wenzel, N., Huang, F., Shah, D., Du, X., Zhang, B., Li, Y ., Dodge, S., You, K., Yang, Z., Timofeev, A., Xu, M., Chen, H., Fauconnier, J., Lai, Z., You, H., Wang, Z., Dehghan, A., Grasch, P., Yang, Y ., 2024. MM1.5: methods, analysis & insights from multimodal LLM fine-tuning. CoRR abs/2409.20566
-
[55]
Provide a short description for this region
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M., 2024. Minigpt-4: Enhancing vision-language understanding with advanced large language models, in: ICLR, OpenReview.net. 14 Configurations Pre-Training Supervised Finetuning Vision Encoder siglip-so400m-patch14-384 VL-Connector Perceiver Resampler LLM Phi-3.5-mini-instruct Trainable Modules VL-Connector, ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.