Recognition: 2 theorem links
· Lean TheoremFirebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
Pith reviewed 2026-05-10 18:57 UTC · model grok-4.3
The pith
Replacing the Transformer decoder with an LFM and adding a Token-Grid Correlation Module lets vision-language models do fine-grained tasks at linear computational cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Firebolt-VL achieves accurate, fine-grained understanding by replacing the Transformer-based decoder with a Liquid Foundation Model decoder and introducing a Token-Grid Correlation Module that computes lightweight correlations between text tokens and image patches, then modulates the result via the state-space model with FiLM conditioning to emphasize task-relevant visual regions while preserving linear-time inference.
What carries the argument
The Token-Grid Correlation Module, which computes correlations between text tokens and image patches and modulates them through the state-space model with FiLM conditioning to enable selective visual grounding.
If this is right
- Inference cost scales linearly with sequence length instead of quadratically, enabling longer prompts or higher-resolution images.
- Smaller vision-language models can now maintain performance on fine-grained reasoning without the overhead of full cross-attention.
- Deployment becomes feasible in resource-limited settings such as personal assistants, document readers, and edge cameras.
- The same modulation approach can be applied to other state-space or linear-time decoders without redesigning the entire architecture.
Where Pith is reading between the lines
- State-space models appear viable as drop-in replacements for attention in multimodal fusion, opening a route to larger context windows.
- The correlation module could be tested on video or multi-image inputs where quadratic attention quickly becomes prohibitive.
- Combining the module with other efficient vision backbones might push the accuracy-efficiency frontier further than either technique alone.
Load-bearing premise
The LFM decoder and Token-Grid Correlation Module together deliver the claimed accuracy and efficiency gains on fine-grained tasks without hidden costs or extra tuning that would appear in real deployment.
What would settle it
A side-by-side test on a fine-grained benchmark such as GQA or VQA-v2 that measures both task accuracy and wall-clock inference time against a comparable Transformer decoder; if accuracy falls or runtime shows no clear linear advantage, the claim is falsified.
Figures
read the original abstract
Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. It further proposes a Token-Grid Correlation Module to compute lightweight correlations between text tokens and image patches, with modulation performed via a state-space model using FiLM conditioning. The design targets linear-time inference while improving visual grounding for fine-grained reasoning tasks. Experimental results on multiple benchmarks are claimed to show accurate fine-grained understanding with significantly improved efficiency over existing MLLMs.
Significance. If the accuracy-efficiency tradeoff is validated, the approach could meaningfully advance deployable VL models for resource-constrained applications by replacing quadratic cross-attention with linear-complexity alternatives while preserving fine-grained performance. The open release of model and code supports reproducibility and follow-on work.
minor comments (3)
- [Abstract] Abstract: the claim of 'significantly improved efficiency' and 'accurate, fine-grained understanding' would be strengthened by including at least one or two key quantitative metrics (e.g., accuracy delta and FLOPs/latency reduction versus a named baseline) rather than leaving the results entirely qualitative.
- [§3.2] §3.2 (Token-Grid Correlation Module): the linear-time inference claim would benefit from an explicit big-O derivation or table comparing complexity to standard cross-attention; currently the description is high-level and relies on the state-space model properties without a self-contained proof sketch.
- [§4] §4 (Experiments): while benchmarks are mentioned, the manuscript should ensure ablation tables explicitly isolate the LFM decoder contribution from the Token-Grid module and report variance across multiple runs to support the fine-grained reasoning claims.
Simulated Author's Rebuttal
We thank the referee for the positive review and recommendation for minor revision. The provided summary accurately reflects the core contributions of Firebolt-VL, including the replacement of the Transformer decoder with an LFM-based decoder and the introduction of the Token-Grid Correlation Module for linear-complexity cross-modality modulation.
Circularity Check
No significant circularity detected
full rationale
The paper introduces Firebolt-VL via architectural choices (LFM decoder replacing Transformer, Token-Grid Correlation Module with state-space model and FiLM) motivated by efficiency and fine-grained grounding needs. No equations, derivations, predictions, or fitted parameters are shown that reduce to inputs by construction. Claims rest on empirical benchmark results rather than self-referential math or self-citation chains. The argument is self-contained as a design proposal evaluated externally.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearreplaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder... Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning... O(T GDt + T D²t + T D t f(T))
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclearS4 and S4D yield higher performance... structured state-space models are more effective in capturing the interactions
Forward citations
Cited by 1 Pith paper
-
Rethinking Constraint Awareness for Efficient State Embedding of Neural Routing Solver
The CARM module boosts neural routing solvers by adaptively modulating embeddings with constraint variables, enabling better use of global observations and improved performance on constrained VRPs.
Reference graph
Works this paper leans on
-
[1]
Liquid Foundation Models: Our First Series of Generative AI Models|Liquid AI, 2024
Liquid AI. Liquid Foundation Models: Our First Series of Generative AI Models|Liquid AI, 2024. 2, 3, 8
2024
-
[2]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual Question Answering. In2015 IEEE International Conference on Computer Vision (ICCV), pages 2425–2433, 2015. ISSN: 2380-7504. 1
2015
-
[3]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 1, 2, 5, 6
work page internal anchor Pith review arXiv 2023
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Minigpt-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning.arXiv preprint arXiv:2310.09478, 2023. 1, 5, 6
-
[6]
An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InProceedings of the Euro- pean Conference on Computer Vision, pages 19–35, 2024. 1, 2, 5, 6
2024
-
[7]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.arXiv preprint arXiv:2312.14238, 2023. 5
work page internal anchor Pith review arXiv 2023
-
[8]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024. 5
work page internal anchor Pith review arXiv 2024
-
[9]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 2
2024
-
[10]
Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, et al. Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024. 1, 2, 3, 5, 6, 7
-
[11]
Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 1, 2
2022
-
[12]
Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement
Qianhan Feng, Wenshuo Li, Tong Lin, and Xinghao Chen. Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 4178–4188, 2025. 2
2025
-
[13]
Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2025. 5, 8
2025
-
[14]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 5
2017
-
[15]
Mamba: Linear-time sequence mod- eling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InProceedings of the First conference on language modeling, 2024. 1, 3, 7
2024
-
[16]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher R ´e. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021. 3, 7
work page internal anchor Pith review arXiv 2021
-
[17]
On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022
Albert Gu, Karan Goel, Ankit Gupta, and Christopher R ´e. On the parameterization and initialization of diagonal state space models.Advances in Neural Information Processing Systems, 35:35971–35983, 2022. 3, 7
2022
-
[18]
State-space models.Handbook of econo- metrics, 4:3039–3080, 1994
James D Hamilton. State-space models.Handbook of econo- metrics, 4:3039–3080, 1994. 3
1994
-
[19]
Liq- uid structural state-space models
Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liq- uid structural state-space models. InProceedings of the Eleventh International Conference on Learning Representa- tions, 2023. 2, 3
2023
-
[20]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InProceedings of the European con- ference on computer vision, pages 235–251, 2016. 5, 7, 8
2016
-
[21]
Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Pro- cessing Systems, 36:71683–71702, 2023
Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Sid- dharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents.Advances in Neural Information Pro- cessing Systems, 36:71683–71702, 2023. 1, 2, 5, 6
2023
-
[22]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 1, 2
work page internal anchor Pith review arXiv 2024
-
[23]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742, 2023. 1, 7
2023
-
[24]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 7
work page internal anchor Pith review arXiv 2023
-
[25]
arXiv preprint arXiv:2401.15947 , year=
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Yatian Pang, Munan Ning, et al. Moe-llava: Mixture of experts for large vision- language models.arXiv preprint arXiv:2401.15947, 2024. 1, 2, 5, 6, 7
-
[26]
Visual Instruction Tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 1
2023
-
[27]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2
2023
-
[28]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1, 2
2024
-
[29]
Mmbench: Is your multi-modal model an all-around player? InProceedings of the European confer- ence on computer vision, pages 216–233, 2024
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InProceedings of the European confer- ence on computer vision, pages 216–233, 2024. 5
2024
-
[30]
Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
-
[31]
SmolVLM: Redefining small and efficient multimodal models
Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 1, 2, 3, 5, 6, 7
work page internal anchor Pith review arXiv 2025
-
[32]
Ng, Bo Pang, Piyush Sharma, and Radu Soricut
Edwin G. Ng, Bo Pang, Piyush Sharma, and Radu Soricut. Understanding guided image captioning performance across domains.arXiv preprint arXiv:2012.02339, 2020. 5
-
[33]
Hierarchical Visual Feature Aggre- gation for OCR-Free Document Understanding, 2024
Jaeyoo Park, Jin Young Choi, Jeonghyung Park, and Bohyung Han. Hierarchical Visual Feature Aggre- gation for OCR-Free Document Understanding, 2024. arXiv:2411.05254 [cs]. 1
-
[34]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 1, 5, 6
work page internal anchor Pith review arXiv 2023
-
[35]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI con- ference on artificial intelligence, 2018. 3
2018
-
[36]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018. 5
2018
-
[37]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 1, 2, 5, 6
work page internal anchor Pith review arXiv 2024
-
[38]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1, 3
2017
-
[40]
arXiv preprint arXiv:2411.10442 , year=
Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhancing the reason- ing ability of multimodal large language models via mixed preference optimization.arXiv preprint arXiv:2411.10442,
-
[41]
DeepSeek-OCR: Contexts Optical Compression
Haoran Wei, Yaofeng Sun, and Yukun Li. DeepSeek-OCR: Contexts Optical Compression, 2025. arXiv:2510.18234 [cs]. 1
work page internal anchor Pith review arXiv 2025
-
[42]
Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. Mobilevlm: A vision-language model for better intra-and inter-ui under- standing.arXiv preprint arXiv:2409.14818, 2024. 1, 2, 3, 5, 6, 7
-
[43]
Llava-cot: Let vision language models reason step- by-step, 2024
Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step- by-step, 2024. 5
2024
-
[44]
Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024
Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, and Jingdong Wang. Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024. 2
2024
-
[45]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 5, 7, 8
2024
-
[46]
Fei Zhao, Taotian Pang, Chunhui Li, Zhen Wu, Junjie Guo, Shangyu Xing, and Xinyu Dai. Aligngpt: Multi-modal large language models with adaptive alignment capability.arXiv preprint arXiv:2405.14129, 2024. 2
-
[47]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 1, 2, 5, 6
work page internal anchor Pith review arXiv 2023
-
[48]
Llava-phi: Efficient multi-modal assistant with small language model
Yichen Zhu, Minjie Zhu, Ning Liu, Zhiyuan Xu, and Yaxin Peng. Llava-phi: Efficient multi-modal assistant with small language model. InProceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited, pages 18–22, 2024. 1, 2, 5, 6
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.