arxiv: 2604.04579 · v2 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

Bo Zhao, Debesh Jha, Mustapha Abdullahi, Quoc-Huy Trinh

Pith reviewed 2026-05-10 18:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsefficient decodingstate-space modelscross-modality modulationvisual groundinglinear-time inferencefine-grained understanding

0 comments

The pith

Replacing the Transformer decoder with an LFM and adding a Token-Grid Correlation Module lets vision-language models do fine-grained tasks at linear computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to prove that swapping the quadratic Transformer decoder for a Liquid Foundation Model decoder, together with a lightweight Token-Grid Correlation Module, produces vision-language models that remain accurate on detailed visual reasoning while running far more efficiently. A sympathetic reader would care because current multimodal models are too expensive to run on phones, cameras, or document tools, restricting their everyday use. The new design keeps inference linear by computing simple correlations between text tokens and image patches and then modulating those features through a state-space model with FiLM conditioning.

Core claim

Firebolt-VL achieves accurate, fine-grained understanding by replacing the Transformer-based decoder with a Liquid Foundation Model decoder and introducing a Token-Grid Correlation Module that computes lightweight correlations between text tokens and image patches, then modulates the result via the state-space model with FiLM conditioning to emphasize task-relevant visual regions while preserving linear-time inference.

What carries the argument

The Token-Grid Correlation Module, which computes correlations between text tokens and image patches and modulates them through the state-space model with FiLM conditioning to enable selective visual grounding.

If this is right

Inference cost scales linearly with sequence length instead of quadratically, enabling longer prompts or higher-resolution images.
Smaller vision-language models can now maintain performance on fine-grained reasoning without the overhead of full cross-attention.
Deployment becomes feasible in resource-limited settings such as personal assistants, document readers, and edge cameras.
The same modulation approach can be applied to other state-space or linear-time decoders without redesigning the entire architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

State-space models appear viable as drop-in replacements for attention in multimodal fusion, opening a route to larger context windows.
The correlation module could be tested on video or multi-image inputs where quadratic attention quickly becomes prohibitive.
Combining the module with other efficient vision backbones might push the accuracy-efficiency frontier further than either technique alone.

Load-bearing premise

The LFM decoder and Token-Grid Correlation Module together deliver the claimed accuracy and efficiency gains on fine-grained tasks without hidden costs or extra tuning that would appear in real deployment.

What would settle it

A side-by-side test on a fine-grained benchmark such as GQA or VQA-v2 that measures both task accuracy and wall-clock inference time against a comparable Transformer decoder; if accuracy falls or runtime shows no clear linear advantage, the claim is falsified.

Figures

Figures reproduced from arXiv: 2604.04579 by Bo Zhao, Debesh Jha, Mustapha Abdullahi, Quoc-Huy Trinh.

**Figure 2.** Figure 2: Overview of the Firebolt-VL architecture. The Cross-Modal Modulator (CMM) fuses textual instructions with the visual repre [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Accuracy–latency comparison of compact MLLMs. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Firebolt-VL swaps the Transformer decoder for an LFM one and adds a Token-Grid Correlation Module to cut compute while keeping visual grounding, but the abstract gives no numbers so the real tradeoff stays unproven.

read the letter

The paper's core move is replacing the usual Transformer decoder with an LFM decoder and layering on a Token-Grid Correlation Module that computes light correlations between text tokens and image patches, then modulates them through a state-space model with FiLM conditioning. This setup targets linear-time inference and better focus on task-relevant image regions for fine-grained VL tasks. That combination looks like the actual new piece, even if it builds on existing state-space ideas and efficient attention variants. The motivation is straightforward and practical: current MLLMs are too heavy for edge devices like smart cameras or document tools, and small models often miss precise visual details. The architecture description addresses both the quadratic cost and the grounding problem in one pass, which is a clean way to frame the work. What the paper does well is keep the efficiency goal front and center without overclaiming theoretical breakthroughs. The linear-time claim follows directly from the state-space choice, and the modulation step is described at a level that makes sense for preserving accuracy. No obvious internal contradictions show up in the stated construction. The soft spot is the complete absence of numbers in the abstract. No speedups, no accuracy scores, no baselines, and no ablation tables are mentioned, so it is impossible to judge whether the claimed tradeoff actually materializes or whether hidden costs appear in practice. If the full paper has solid comparisons to recent efficient VL models and clear ablations on the module, that would fix it; right now the claims rest on unshown data. This is the kind of paper that matters to researchers working on deployable multimodal models and state-space alternatives to attention. A reader already following efficient VL or edge AI would get value from the architecture details and could test the ideas themselves. It is not foundational, but the problem is real and the construction is concrete enough to evaluate. I would send it to peer review so the experiments can be checked properly.

Referee Report

0 major / 3 minor

Summary. The paper introduces Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. It further proposes a Token-Grid Correlation Module to compute lightweight correlations between text tokens and image patches, with modulation performed via a state-space model using FiLM conditioning. The design targets linear-time inference while improving visual grounding for fine-grained reasoning tasks. Experimental results on multiple benchmarks are claimed to show accurate fine-grained understanding with significantly improved efficiency over existing MLLMs.

Significance. If the accuracy-efficiency tradeoff is validated, the approach could meaningfully advance deployable VL models for resource-constrained applications by replacing quadratic cross-attention with linear-complexity alternatives while preserving fine-grained performance. The open release of model and code supports reproducibility and follow-on work.

minor comments (3)

[Abstract] Abstract: the claim of 'significantly improved efficiency' and 'accurate, fine-grained understanding' would be strengthened by including at least one or two key quantitative metrics (e.g., accuracy delta and FLOPs/latency reduction versus a named baseline) rather than leaving the results entirely qualitative.
[§3.2] §3.2 (Token-Grid Correlation Module): the linear-time inference claim would benefit from an explicit big-O derivation or table comparing complexity to standard cross-attention; currently the description is high-level and relies on the state-space model properties without a self-contained proof sketch.
[§4] §4 (Experiments): while benchmarks are mentioned, the manuscript should ensure ablation tables explicitly isolate the LFM decoder contribution from the Token-Grid module and report variance across multiple runs to support the fine-grained reasoning claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation for minor revision. The provided summary accurately reflects the core contributions of Firebolt-VL, including the replacement of the Transformer decoder with an LFM-based decoder and the introduction of the Token-Grid Correlation Module for linear-complexity cross-modality modulation.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Firebolt-VL via architectural choices (LFM decoder replacing Transformer, Token-Grid Correlation Module with state-space model and FiLM) motivated by efficiency and fine-grained grounding needs. No equations, derivations, predictions, or fitted parameters are shown that reduce to inputs by construction. Claims rest on empirical benchmark results rather than self-referential math or self-citation chains. The argument is self-contained as a design proposal evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or invented entities are described in the abstract; all details are deferred to the unreviewed full manuscript.

pith-pipeline@v0.9.0 · 5513 in / 1008 out tokens · 19776 ms · 2026-05-10T18:57:46.499472+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder... Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning... O(T GDt + T D²t + T D t f(T))
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear
S4 and S4D yield higher performance... structured state-space models are more effective in capturing the interactions

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking Constraint Awareness for Efficient State Embedding of Neural Routing Solver
cs.AI 2026-05 unverdicted novelty 6.0

The CARM module boosts neural routing solvers by adaptively modulating embeddings with constraint variables, enabling better use of global observations and improved performance on constrained VRPs.

Reference graph

Works this paper leans on

48 extracted references · 21 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

Liquid Foundation Models: Our First Series of Generative AI Models|Liquid AI, 2024

Liquid AI. Liquid Foundation Models: Our First Series of Generative AI Models|Liquid AI, 2024. 2, 3, 8

2024
[2]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015. ISSN: 2380-7504. 1

2015
[3]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 1, 2, 5, 6

work page internal anchor Pith review arXiv 2023
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023. 1, 5, 6

work page arXiv 2023
[6]

An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InProceedings of the Euro- pean Conference on Computer Vision, pages 19–35, 2024. 1, 2, 5, 6

2024
[7]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023. 5

work page internal anchor Pith review arXiv 2023
[8]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024. 5

work page internal anchor Pith review arXiv 2024
[9]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 2

2024
[10]

Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024. 1, 2, 3, 5, 6, 7

work page arXiv 2024
[11]

Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 1, 2

2022
[12]

Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement

Qianhan Feng, Wenshuo Li, Tong Lin, and Xinghao Chen. Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 4178–4188, 2025. 2

2025
[13]

Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025. 5, 8

2025
[14]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 5

2017
[15]

Mamba: Linear-time sequence mod- eling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InProceedings of the First conference on language modeling, 2024. 1, 3, 7

2024
[16]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher R ´e. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021. 3, 7

work page internal anchor Pith review arXiv 2021
[17]

On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022

Albert Gu, Karan Goel, Ankit Gupta, and Christopher R ´e. On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022. 3, 7

2022
[18]

State-space models.Handbook of econo- metrics, 4:3039–3080, 1994

James D Hamilton. State-space models.Handbook of econo- metrics, 4:3039–3080, 1994. 3

1994
[19]

Liq- uid structural state-space models

Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liq- uid structural state-space models. InProceedings of the Eleventh International Conference on Learning Representa- tions, 2023. 2, 3

2023
[20]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InProceedings of the European con- ference on computer vision, pages 235–251, 2016. 5, 7, 8

2016
[21]

Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Pro- cessing Systems, 36:71683–71702, 2023

Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Sid- dharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Pro- cessing Systems, 36:71683–71702, 2023. 1, 2, 5, 6

2023
[22]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 1, 2

work page internal anchor Pith review arXiv 2024
[23]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742, 2023. 1, 7

2023
[24]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 7

work page internal anchor Pith review arXiv 2023
[25]

arXiv preprint arXiv:2401.15947 , year=

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 1, 2, 5, 6, 7

work page arXiv 2024
[26]

Visual Instruction Tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 1

2023
[27]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

2023
[28]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1, 2

2024
[29]

Mmbench: Is your multi-modal model an all-around player? InProceedings of the European confer- ence on computer vision, pages 216–233, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InProceedings of the European confer- ence on computer vision, pages 216–233, 2024. 5

2024
[30]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
[31]

SmolVLM: Redefining small and efficient multimodal models

Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 1, 2, 3, 5, 6, 7

work page internal anchor Pith review arXiv 2025
[32]

Ng, Bo Pang, Piyush Sharma, and Radu Soricut

Edwin G. Ng, Bo Pang, Piyush Sharma, and Radu Soricut. Understanding guided image captioning performance across domains.arXiv preprint arXiv:2012.02339, 2020. 5

work page arXiv 2012
[33]

Hierarchical Visual Feature Aggre- gation for OCR-Free Document Understanding, 2024

Jaeyoo Park, Jin Young Choi, Jeonghyung Park, and Bohyung Han. Hierarchical Visual Feature Aggre- gation for OCR-Free Document Understanding, 2024. arXiv:2411.05254 [cs]. 1

work page arXiv 2024
[34]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 1, 5, 6

work page internal anchor Pith review arXiv 2023
[35]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI con- ference on artificial intelligence, 2018. 3

2018
[36]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018. 5

2018
[37]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 2, 5, 6

work page internal anchor Pith review arXiv 2024
[38]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1, 3

2017
[40]

arXiv preprint arXiv:2411.10442 , year=

Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reason- ing ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442,

work page arXiv
[41]

DeepSeek-OCR: Contexts Optical Compression

Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek-OCR: Contexts Optical Compression, 2025. arXiv:2510.18234 [cs]. 1

work page internal anchor Pith review arXiv 2025
[42]

Mobilevlm: A vision-language model for better intra-and inter-ui under- standing.arXiv preprint arXiv:2409.14818, 2024

Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. Mobilevlm: A vision-language model for better intra-and inter-ui under- standing.arXiv preprint arXiv:2409.14818, 2024. 1, 2, 3, 5, 6, 7

work page arXiv 2024
[43]

Llava-cot: Let vision language models reason step- by-step, 2024

Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step- by-step, 2024. 5

2024
[44]

Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024

Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, and Jingdong Wang. Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024. 2

2024
[45]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 5, 7, 8

2024
[46]

Aligngpt: Multi-modal large language models with adaptive alignment capability.arXiv preprint arXiv:2405.14129, 2024

Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, and Xinyu Dai. Aligngpt: Multi-modal large language models with adaptive alignment capability.arXiv preprint arXiv:2405.14129, 2024. 2

work page arXiv 2024
[47]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 1, 2, 5, 6

work page internal anchor Pith review arXiv 2023
[48]

Llava-phi: Efficient multi-modal assistant with small language model

Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. Llava-phi: Efficient multi-modal assistant with small language model. InProceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited, pages 18–22, 2024. 1, 2, 5, 6

2024