Recognition: 1 theorem link
· Lean TheoremMobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Pith reviewed 2026-05-16 16:29 UTC · model grok-4.3
The pith
A vision-language model with 1.4B and 2.7B parameters matches larger models while running at 65 tokens per second on mobile GPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MobileVLM is an amalgamation of mobile-oriented designs including 1.4B and 2.7B parameter language models trained from scratch, a CLIP-fashion pre-trained multimodal vision model, and an efficient projector for cross-modality interaction. It demonstrates on par performance with much larger models on VLM benchmarks and achieves state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 and 65.3 on Jetson Orin.
What carries the argument
The efficient projector enabling interaction between the small language models and the CLIP vision model.
If this is right
- Mobile vision-language applications can run locally on devices with limited compute.
- Smaller models can deliver competitive multimodal performance when using targeted pretraining and projection techniques.
- Open release of such models supports further development in efficient on-device AI.
- High token throughput allows for responsive user interactions in real-time mobile scenarios.
Where Pith is reading between the lines
- Such designs might reduce the need for large data centers by shifting multimodal processing to edge devices.
- Similar combinations could be tested with other vision encoders beyond CLIP for potential gains.
- The speed improvements suggest viability for interactive applications like real-time image description on phones.
Load-bearing premise
The combination of scratch-trained small LMs, CLIP vision pretraining, and efficient projector produces competitive benchmark scores.
What would settle it
Benchmark results where MobileVLM scores fall well below those of larger models on the same VLM tests would disprove the on-par performance.
read the original abstract
We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MobileVLM, a multimodal vision-language model for mobile devices that combines 1.4B and 2.7B parameter language models trained from scratch, a CLIP-pretrained vision encoder, and an efficient projector for cross-modality interaction. It reports competitive performance on standard VLM benchmarks relative to larger models and claims state-of-the-art inference speeds of 21.5 tokens/s on a Qualcomm Snapdragon 888 CPU and 65.3 tokens/s on an NVIDIA Jetson Orin GPU, with code to be released.
Significance. If the empirical claims hold under scrutiny, the work offers a practical open-source contribution toward deploying capable VLMs on resource-constrained mobile hardware, filling a gap between large-scale models and edge deployment. The emphasis on hardware-specific throughput measurements and the planned code release support reproducibility and further research in efficient multimodal systems.
major comments (2)
- Abstract: The state-of-the-art inference speed claims of 21.5 tokens/s on Snapdragon 888 and 65.3 tokens/s on Jetson Orin are presented without side-by-side token/s measurements for competing VLMs (such as LLaVA variants or Phi-2) on identical hardware, batch size, prompt length, and precision settings. This absence directly undermines the SOTA assertion, as the central performance claim rests on unverified superiority rather than direct evidence.
- Abstract: The claim of 'on par performance compared with a few much larger models' is not supported by specific benchmark scores, exact model sizes, dataset splits, or error bars in the provided text. Without these details or tables showing variance across runs, it is impossible to assess whether the results substantiate the performance parity.
minor comments (2)
- Abstract: Typo in hardware name ('Jeston Orin' should read 'Jetson Orin').
- Abstract: The list of 'several typical VLM benchmarks' is not enumerated, reducing clarity on the evaluation scope.
Simulated Author's Rebuttal
We thank the referee for the thoughtful comments, which help clarify the presentation of our results. We address each major comment below and will revise the manuscript to improve the abstract's precision while maintaining the integrity of our empirical claims.
read point-by-point responses
-
Referee: Abstract: The state-of-the-art inference speed claims of 21.5 tokens/s on Snapdragon 888 and 65.3 tokens/s on Jetson Orin are presented without side-by-side token/s measurements for competing VLMs (such as LLaVA variants or Phi-2) on identical hardware, batch size, prompt length, and precision settings. This absence directly undermines the SOTA assertion, as the central performance claim rests on unverified superiority rather than direct evidence.
Authors: We agree that side-by-side measurements on identical hardware, batch size, prompt length, and precision would provide the strongest possible support for the SOTA claim. Many competing models have not reported results under these exact conditions, which limits direct replication. In the revision we will expand the abstract and add a dedicated comparison table that includes all publicly available speed numbers from the literature (with hardware and settings noted) alongside our own measurements, and we will qualify the SOTA statement to reflect the specific devices and conditions used. revision: yes
-
Referee: Abstract: The claim of 'on par performance compared with a few much larger models' is not supported by specific benchmark scores, exact model sizes, dataset splits, or error bars in the provided text. Without these details or tables showing variance across runs, it is impossible to assess whether the results substantiate the performance parity.
Authors: The full manuscript contains detailed tables with exact benchmark scores, model sizes (1.4B and 2.7B), dataset splits, and comparisons against larger models. To address the concern, we will revise the abstract to include a concise set of representative scores (e.g., on VQA-v2, GQA, and ScienceQA) together with the corresponding model sizes, while continuing to direct readers to the main-text tables for complete results and any run-to-run variance. revision: yes
Circularity Check
No circularity; empirical architecture and benchmark results are self-contained
full rationale
The paper describes MobileVLM as an amalgamation of small LMs (1.4B/2.7B parameters trained from scratch), CLIP-style vision pretraining, and an efficient projector. Performance is reported via direct evaluation on VLM benchmarks and hardware-specific inference measurements (21.5 tokens/s on Snapdragon 888, 65.3 on Jetson Orin). No equations, derivations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the text. Claims rest on external benchmark numbers and measured throughputs rather than any reduction to the paper's own inputs by construction. The SOTA speed assertion is an empirical comparison claim, not a circular derivation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 17 Pith papers
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
-
LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.
-
Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
Distills 3D spatial reasoning from a 7B teacher VLM to a 2.29B student using VGGT encoder, multi-task losses, and Hidden CoT latent tokens, yielding 8.7x lower latency with 54-72% performance retention on ScanNet and ...
-
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
-
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
-
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
-
Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis
RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.
-
ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality
ClickAIXR combines controller-based object selection in XR with on-device VLM inference to enable private, precise multimodal queries about real objects.
-
Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs
Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
An in-depth look at gemini’s language abilities
Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex B ¨auerle, ´Angel Alexander Cabrera, Kr- ish Dholakia, Chenyan Xiong, and Graham Neubig. An in-depth look at gemini’s language abilities. arXiv preprint arXiv:2312.11444, 2023. 1
-
[2]
Flamingo: a visual language model for few-shot learn- ing
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learn- ing. Advances in Neural Information Processing Systems , 35:23716–23736, 2022. 1, 3
work page 2022
-
[3]
Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hes- sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Korn- blith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, Mar. 2023. 7
work page 2023
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 1, 2, 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Vlmo: Unified vision- language pre-training with mixture-of-modality-experts
Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision- language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems , 35:32897–32912, 2022. 2, 3
work page 2022
-
[7]
Pythia: A suite for ana- lyzing large language models across training and scaling
Stella Biderman, Hailey Schoelkopf, Quentin Gregory An- thony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for ana- lyzing large language models across training and scaling. In International Conference on Machine Learning , pages 2397–2430. PMLR, 2023. 2, 6
work page 2023
-
[8]
Piqa: Reasoning about physical commonsense in nat- ural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in nat- ural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 7
work page 2020
-
[9]
GPT-Neo: Large Scale Autoregressive Lan- guage Modeling with Mesh-Tensorflow, Mar
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Lan- guage Modeling with Mesh-Tensorflow, Mar. 2021. If you use this software, please cite it using these metadata. 2
work page 2021
-
[10]
A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset
Michael Boratko, Harshit Padigela, Divyendra Mikkili- neni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapani- pathi, Nicholas Mattei, et al. A systematic classification of knowledge, reasoning, and context within the ARC dataset. arXiv preprint arXiv:1806.00358, 2018. 7
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[11]
Coyo-700m: Image-text pair dataset
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kak aobrain/coyo-dataset, 2022. 3
work page 2022
-
[12]
Once for all: Train one network and specialize it for efficient deployment
Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020. 4
work page 2020
-
[13]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021. 3
work page 2021
-
[14]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large lan- guage models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
PaLI-X: On scaling up a multilingual vision and language model
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...
work page 2023
-
[19]
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, 12 Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 5, 6
work page 2023
-
[21]
Make repvgg greater again: A quantization-aware approach
Xiangxiang Chu, Liang Li, and Bo Zhang. Make repvgg greater again: A quantization-aware approach. In AAAI,
-
[22]
Twins: Revisiting the design of spatial attention in vision transformers
Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. In Adv. Neural Inform. Process. Syst., 2021. 2, 10
work page 2021
-
[23]
Conditional positional encodings for vision transformers
Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2023. 4
work page 2023
-
[24]
Fairnas: Re- thinking evaluation fairness of weight sharing neural archi- tecture search
Xiangxiang Chu, Bo Zhang, and Ruijun Xu. Fairnas: Re- thinking evaluation fairness of weight sharing neural archi- tecture search. In Proceedings of the IEEE/CVF Interna- tional Conference on computer vision, pages 12239–12248,
-
[25]
Fair darts: Eliminating unfair advantages in differentiable architecture search
Xiangxiang Chu, Tianbao Zhou, Bo Zhang, and Jixiang Li. Fair darts: Eliminating unfair advantages in differentiable architecture search. In European conference on computer vision, pages 465–480. Springer, 2020. 4
work page 2020
-
[26]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising dsifficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019. 7
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[28]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Redpajama: An open source recipe to reproduce llama training dataset, 2023
Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. 5, 6
work page 2023
-
[30]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 1, 3, 7, 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 1–10, 2018. 3
work page 2018
-
[33]
Imagenet: A large-scale hierarchical im- age database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 10
work page 2009
-
[34]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2, 4, 10
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[35]
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling.arXiv preprint arXiv:2103.10360, 2021. 2
-
[36]
A survey of embodied ai: From simulators to research tasks
Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. 3
work page 2022
-
[37]
Eva: Exploring the limits of masked visual represen- tation learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 19358–19369, 2023. 2
work page 2023
-
[38]
Sparsegpt: Massive lan- guage models can be accurately pruned in one-shot, 2023
Elias Frantar and Dan Alistarh. Sparsegpt: Massive lan- guage models can be accurately pruned in one-shot, 2023. 3
work page 2023
-
[39]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
A challenger to gpt-4v? early explorations of gemini in visual expertise
Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Zhang Mengdan, Peixian Chen, Sirui Zhao, Shao- hui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hong- sheng Li, and Xing Sun. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023. 1
-
[42]
A framework for few-shot language model evaluation, Sept
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, An- thony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, Sept. 2021. 7
work page 2021
-
[43]
Openllama: An open repro- duction of llama, May 2023
Xinyang Geng and Hao Liu. Openllama: An open repro- duction of llama, May 2023. 2, 6
work page 2023
- [44]
-
[45]
Gemini: A family of highly capable multimodal models
Google. Gemini: A family of highly capable multimodal models. 2023. 1
work page 2023
-
[46]
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Masked autoencoders are scal- 13 able vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- 13 able vision learners. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 16000–16009, 2022. 2, 10
work page 2022
-
[48]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. 7
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[49]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language mod- els. arXiv preprint arXiv:2203.15556, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 4
work page 2019
-
[51]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 1, 8
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[52]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 3, 7
work page 2019
-
[53]
https://huggingface.co/datas ets/Aeala/ShareGPT_Vicuna_unfiltered
Huggingface. https://huggingface.co/datas ets/Aeala/ShareGPT_Vicuna_unfiltered. 6
-
[54]
Gabriel Ilharco, Mitchell Wortsman, Ross Rollman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip. July 2021, 2021. If you use this software, please cite it as below. 3
work page 2021
- [55]
-
[56]
Batch normalization: Accelerating deep network training by reducing internal co- variate shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. In International conference on machine learn- ing, pages 448–456. pmlr, 2015. 5
work page 2015
-
[57]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International confer- ence on machine learning, pages 4651–4664. PMLR, 2021. 3
work page 2021
-
[58]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In In- ternational conference on machine learning , pages 4904–
-
[59]
All tokens matter: Token labeling for training better vision transform- ers
Zi-Hang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transform- ers. Advances in neural information processing systems , 34:18590–18602, 2021. 2, 3
work page 2021
-
[60]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[61]
Referitgame: Referring to objects in pho- tographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. pages 787–798, 2014. 3
work page 2014
-
[62]
Visual Genome: Connecting language and vision using crowdsourced dense image annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123:32–73, 2017. 3
work page 2017
-
[63]
Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[64]
Obelisc: An open web-scale filtered dataset of interleaved image-text documents
Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023. 7
-
[65]
The BigScience corpus: A 1.6 TB composite multilingual dataset
Hugo Laurenc ¸on, Lucile Saulnier, Thomas Wang, Christo- pher Akiki, Albert Villanova del Moral, Teven Le Scao, Le- andro V on Werra, Chenghao Mou, Eduardo Gonz´alez Pon- ferrada, Huu Nguyen, et al. The BigScience corpus: A 1.6 TB composite multilingual dataset. 2022. 2
work page 2022
-
[66]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 1, 2, 3, 4, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In International Conference on Machine Learning , pages 12888–12900. PMLR, 2022. 3
work page 2022
-
[68]
Norm tweaking: High-performance low-bit quantization of large language models
Liang Li, Qingyuan Li, Bo Zhang, and Xiangxiang Chu. Norm tweaking: High-performance low-bit quantization of large language models. In AAAI, 2023. 3
work page 2023
-
[69]
Robust navigation with language pretraining and stochastic sampling
Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, and Yejin Choi. Robust navigation with language pretraining and stochastic sampling. arXiv preprint arXiv:1909.02244, 2019. 3
-
[70]
Textbooks are all you need ii: phi-1.5 technical report, 2023
Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023. 2
work page 2023
-
[71]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021. 7
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[73]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Eur. Conf. Comput. Vis. , pages 740–755. Springer, 2014. 3 14
work page 2014
-
[75]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023. 10, 11
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023. 1, 3, 4, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
DARTS: Differentiable architecture search
Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Confer- ence on Learning Representations, 2019. 4
work page 2019
-
[78]
Llava-plus: Learning to use tools for creating multi- modal agents
Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multi- modal agents. arXiv preprint arXiv:2311.05437, 2023. 1, 10
-
[79]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[80]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[81]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2, 10
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.