pith. machine review for the scientific record. sign in

arxiv: 2604.04579 · v2 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Bo Zhao, Debesh Jha, Mustapha Abdullahi, Quoc-Huy Trinh

Pith reviewed 2026-05-10 18:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsefficient decodingstate-space modelscross-modality modulationvisual groundinglinear-time inferencefine-grained understanding
0
0 comments X

The pith

Replacing the Transformer decoder with an LFM and adding a Token-Grid Correlation Module lets vision-language models do fine-grained tasks at linear computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that swapping the quadratic Transformer decoder for a Liquid Foundation Model decoder, together with a lightweight Token-Grid Correlation Module, produces vision-language models that remain accurate on detailed visual reasoning while running far more efficiently. A sympathetic reader would care because current multimodal models are too expensive to run on phones, cameras, or document tools, restricting their everyday use. The new design keeps inference linear by computing simple correlations between text tokens and image patches and then modulating those features through a state-space model with FiLM conditioning.

Core claim

Firebolt-VL achieves accurate, fine-grained understanding by replacing the Transformer-based decoder with a Liquid Foundation Model decoder and introducing a Token-Grid Correlation Module that computes lightweight correlations between text tokens and image patches, then modulates the result via the state-space model with FiLM conditioning to emphasize task-relevant visual regions while preserving linear-time inference.

What carries the argument

The Token-Grid Correlation Module, which computes correlations between text tokens and image patches and modulates them through the state-space model with FiLM conditioning to enable selective visual grounding.

If this is right

  • Inference cost scales linearly with sequence length instead of quadratically, enabling longer prompts or higher-resolution images.
  • Smaller vision-language models can now maintain performance on fine-grained reasoning without the overhead of full cross-attention.
  • Deployment becomes feasible in resource-limited settings such as personal assistants, document readers, and edge cameras.
  • The same modulation approach can be applied to other state-space or linear-time decoders without redesigning the entire architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • State-space models appear viable as drop-in replacements for attention in multimodal fusion, opening a route to larger context windows.
  • The correlation module could be tested on video or multi-image inputs where quadratic attention quickly becomes prohibitive.
  • Combining the module with other efficient vision backbones might push the accuracy-efficiency frontier further than either technique alone.

Load-bearing premise

The LFM decoder and Token-Grid Correlation Module together deliver the claimed accuracy and efficiency gains on fine-grained tasks without hidden costs or extra tuning that would appear in real deployment.

What would settle it

A side-by-side test on a fine-grained benchmark such as GQA or VQA-v2 that measures both task accuracy and wall-clock inference time against a comparable Transformer decoder; if accuracy falls or runtime shows no clear linear advantage, the claim is falsified.

Figures

Figures reproduced from arXiv: 2604.04579 by Bo Zhao, Debesh Jha, Mustapha Abdullahi, Quoc-Huy Trinh.

Figure 1
Figure 1. Figure 1: Performance and latency efficiency of Firebolt-VL on [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Firebolt-VL architecture. The Cross-Modal Modulator (CMM) fuses textual instructions with the visual repre [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy–latency comparison of compact MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. It further proposes a Token-Grid Correlation Module to compute lightweight correlations between text tokens and image patches, with modulation performed via a state-space model using FiLM conditioning. The design targets linear-time inference while improving visual grounding for fine-grained reasoning tasks. Experimental results on multiple benchmarks are claimed to show accurate fine-grained understanding with significantly improved efficiency over existing MLLMs.

Significance. If the accuracy-efficiency tradeoff is validated, the approach could meaningfully advance deployable VL models for resource-constrained applications by replacing quadratic cross-attention with linear-complexity alternatives while preserving fine-grained performance. The open release of model and code supports reproducibility and follow-on work.

minor comments (3)
  1. [Abstract] Abstract: the claim of 'significantly improved efficiency' and 'accurate, fine-grained understanding' would be strengthened by including at least one or two key quantitative metrics (e.g., accuracy delta and FLOPs/latency reduction versus a named baseline) rather than leaving the results entirely qualitative.
  2. [§3.2] §3.2 (Token-Grid Correlation Module): the linear-time inference claim would benefit from an explicit big-O derivation or table comparing complexity to standard cross-attention; currently the description is high-level and relies on the state-space model properties without a self-contained proof sketch.
  3. [§4] §4 (Experiments): while benchmarks are mentioned, the manuscript should ensure ablation tables explicitly isolate the LFM decoder contribution from the Token-Grid module and report variance across multiple runs to support the fine-grained reasoning claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation for minor revision. The provided summary accurately reflects the core contributions of Firebolt-VL, including the replacement of the Transformer decoder with an LFM-based decoder and the introduction of the Token-Grid Correlation Module for linear-complexity cross-modality modulation.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Firebolt-VL via architectural choices (LFM decoder replacing Transformer, Token-Grid Correlation Module with state-space model and FiLM) motivated by efficiency and fine-grained grounding needs. No equations, derivations, predictions, or fitted parameters are shown that reduce to inputs by construction. Claims rest on empirical benchmark results rather than self-referential math or self-citation chains. The argument is self-contained as a design proposal evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented entities are described in the abstract; all details are deferred to the unreviewed full manuscript.

pith-pipeline@v0.9.0 · 5513 in / 1008 out tokens · 19776 ms · 2026-05-10T18:57:46.499472+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking Constraint Awareness for Efficient State Embedding of Neural Routing Solver

    cs.AI 2026-05 unverdicted novelty 6.0

    The CARM module boosts neural routing solvers by adaptively modulating embeddings with constraint variables, enabling better use of global observations and improved performance on constrained VRPs.

Reference graph

Works this paper leans on

48 extracted references · 21 canonical work pages · cited by 1 Pith paper · 13 internal anchors

  1. [1]

    Liquid Foundation Models: Our First Series of Generative AI Models|Liquid AI, 2024

    Liquid AI. Liquid Foundation Models: Our First Series of Generative AI Models|Liquid AI, 2024. 2, 3, 8

  2. [2]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015. ISSN: 2380-7504. 1

  3. [3]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 1, 2, 5, 6

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 1, 2

  5. [5]

    Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023. 1, 5, 6

  6. [6]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InProceedings of the Euro- pean Conference on Computer Vision, pages 19–35, 2024. 1, 2, 5, 6

  7. [7]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023. 5

  8. [8]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024. 5

  9. [9]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 2

  10. [10]

    Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

    Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024. 1, 2, 3, 5, 6, 7

  11. [11]

    Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 1, 2

  12. [12]

    Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement

    Qianhan Feng, Wenshuo Li, Tong Lin, and Xinghao Chen. Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 4178–4188, 2025. 2

  13. [13]

    Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025. 5, 8

  14. [14]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 5

  15. [15]

    Mamba: Linear-time sequence mod- eling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InProceedings of the First conference on language modeling, 2024. 1, 3, 7

  16. [16]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher R ´e. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021. 3, 7

  17. [17]

    On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022

    Albert Gu, Karan Goel, Ankit Gupta, and Christopher R ´e. On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022. 3, 7

  18. [18]

    State-space models.Handbook of econo- metrics, 4:3039–3080, 1994

    James D Hamilton. State-space models.Handbook of econo- metrics, 4:3039–3080, 1994. 3

  19. [19]

    Liq- uid structural state-space models

    Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liq- uid structural state-space models. InProceedings of the Eleventh International Conference on Learning Representa- tions, 2023. 2, 3

  20. [20]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InProceedings of the European con- ference on computer vision, pages 235–251, 2016. 5, 7, 8

  21. [21]

    Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Pro- cessing Systems, 36:71683–71702, 2023

    Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Sid- dharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Pro- cessing Systems, 36:71683–71702, 2023. 1, 2, 5, 6

  22. [22]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 1, 2

  23. [23]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742, 2023. 1, 7

  24. [24]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 7

  25. [25]

    arXiv preprint arXiv:2401.15947 , year=

    Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 1, 2, 5, 6, 7

  26. [26]

    Visual Instruction Tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 1

  27. [27]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

  28. [28]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1, 2

  29. [29]

    Mmbench: Is your multi-modal model an all-around player? InProceedings of the European confer- ence on computer vision, pages 216–233, 2024

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InProceedings of the European confer- ence on computer vision, pages 216–233, 2024. 5

  30. [30]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

  31. [31]

    SmolVLM: Redefining small and efficient multimodal models

    Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 1, 2, 3, 5, 6, 7

  32. [32]

    Ng, Bo Pang, Piyush Sharma, and Radu Soricut

    Edwin G. Ng, Bo Pang, Piyush Sharma, and Radu Soricut. Understanding guided image captioning performance across domains.arXiv preprint arXiv:2012.02339, 2020. 5

  33. [33]

    Hierarchical Visual Feature Aggre- gation for OCR-Free Document Understanding, 2024

    Jaeyoo Park, Jin Young Choi, Jeonghyung Park, and Bohyung Han. Hierarchical Visual Feature Aggre- gation for OCR-Free Document Understanding, 2024. arXiv:2411.05254 [cs]. 1

  34. [34]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 1, 5, 6

  35. [35]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI con- ference on artificial intelligence, 2018. 3

  36. [36]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018. 5

  37. [37]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 2, 5, 6

  38. [38]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 3

  39. [39]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1, 3

  40. [40]

    arXiv preprint arXiv:2411.10442 , year=

    Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reason- ing ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442,

  41. [41]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek-OCR: Contexts Optical Compression, 2025. arXiv:2510.18234 [cs]. 1

  42. [42]

    Mobilevlm: A vision-language model for better intra-and inter-ui under- standing.arXiv preprint arXiv:2409.14818, 2024

    Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. Mobilevlm: A vision-language model for better intra-and inter-ui under- standing.arXiv preprint arXiv:2409.14818, 2024. 1, 2, 3, 5, 6, 7

  43. [43]

    Llava-cot: Let vision language models reason step- by-step, 2024

    Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step- by-step, 2024. 5

  44. [44]

    Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024

    Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, and Jingdong Wang. Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024. 2

  45. [45]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 5, 7, 8

  46. [46]

    Aligngpt: Multi-modal large language models with adaptive alignment capability.arXiv preprint arXiv:2405.14129, 2024

    Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, and Xinyu Dai. Aligngpt: Multi-modal large language models with adaptive alignment capability.arXiv preprint arXiv:2405.14129, 2024. 2

  47. [47]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 1, 2, 5, 6

  48. [48]

    Llava-phi: Efficient multi-modal assistant with small language model

    Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. Llava-phi: Efficient multi-modal assistant with small language model. InProceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited, pages 18–22, 2024. 1, 2, 5, 6