pith. machine review for the scientific record. sign in

arxiv: 2312.16886 · v2 · submitted 2023-12-28 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision language modelmobile devicesmultimodal AIefficient inferencesmall language modelsCLIP pretrainingon-device assistant
0
0 comments X

The pith

A vision-language model with 1.4B and 2.7B parameters matches larger models while running at 65 tokens per second on mobile GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MobileVLM combines small language models trained from scratch with a CLIP-pretrained vision model and an efficient projector to create a multimodal assistant suitable for mobile devices. The models achieve performance comparable to much larger vision-language systems on standard benchmarks. They also deliver high inference speeds, reaching 21.5 tokens per second on a Qualcomm Snapdragon 888 CPU and 65.3 tokens per second on an NVIDIA Jetson Orin GPU. A reader would care because this enables capable AI assistants to operate directly on phones without cloud connectivity or high resource demands. The paper plans to release the code to make the approach accessible.

Core claim

MobileVLM is an amalgamation of mobile-oriented designs including 1.4B and 2.7B parameter language models trained from scratch, a CLIP-fashion pre-trained multimodal vision model, and an efficient projector for cross-modality interaction. It demonstrates on par performance with much larger models on VLM benchmarks and achieves state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 and 65.3 on Jetson Orin.

What carries the argument

The efficient projector enabling interaction between the small language models and the CLIP vision model.

If this is right

  • Mobile vision-language applications can run locally on devices with limited compute.
  • Smaller models can deliver competitive multimodal performance when using targeted pretraining and projection techniques.
  • Open release of such models supports further development in efficient on-device AI.
  • High token throughput allows for responsive user interactions in real-time mobile scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such designs might reduce the need for large data centers by shifting multimodal processing to edge devices.
  • Similar combinations could be tested with other vision encoders beyond CLIP for potential gains.
  • The speed improvements suggest viability for interactive applications like real-time image description on phones.

Load-bearing premise

The combination of scratch-trained small LMs, CLIP vision pretraining, and efficient projector produces competitive benchmark scores.

What would settle it

Benchmark results where MobileVLM scores fall well below those of larger models on the same VLM tests would disprove the on-par performance.

read the original abstract

We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents MobileVLM, a multimodal vision-language model for mobile devices that combines 1.4B and 2.7B parameter language models trained from scratch, a CLIP-pretrained vision encoder, and an efficient projector for cross-modality interaction. It reports competitive performance on standard VLM benchmarks relative to larger models and claims state-of-the-art inference speeds of 21.5 tokens/s on a Qualcomm Snapdragon 888 CPU and 65.3 tokens/s on an NVIDIA Jetson Orin GPU, with code to be released.

Significance. If the empirical claims hold under scrutiny, the work offers a practical open-source contribution toward deploying capable VLMs on resource-constrained mobile hardware, filling a gap between large-scale models and edge deployment. The emphasis on hardware-specific throughput measurements and the planned code release support reproducibility and further research in efficient multimodal systems.

major comments (2)
  1. Abstract: The state-of-the-art inference speed claims of 21.5 tokens/s on Snapdragon 888 and 65.3 tokens/s on Jetson Orin are presented without side-by-side token/s measurements for competing VLMs (such as LLaVA variants or Phi-2) on identical hardware, batch size, prompt length, and precision settings. This absence directly undermines the SOTA assertion, as the central performance claim rests on unverified superiority rather than direct evidence.
  2. Abstract: The claim of 'on par performance compared with a few much larger models' is not supported by specific benchmark scores, exact model sizes, dataset splits, or error bars in the provided text. Without these details or tables showing variance across runs, it is impossible to assess whether the results substantiate the performance parity.
minor comments (2)
  1. Abstract: Typo in hardware name ('Jeston Orin' should read 'Jetson Orin').
  2. Abstract: The list of 'several typical VLM benchmarks' is not enumerated, reducing clarity on the evaluation scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which help clarify the presentation of our results. We address each major comment below and will revise the manuscript to improve the abstract's precision while maintaining the integrity of our empirical claims.

read point-by-point responses
  1. Referee: Abstract: The state-of-the-art inference speed claims of 21.5 tokens/s on Snapdragon 888 and 65.3 tokens/s on Jetson Orin are presented without side-by-side token/s measurements for competing VLMs (such as LLaVA variants or Phi-2) on identical hardware, batch size, prompt length, and precision settings. This absence directly undermines the SOTA assertion, as the central performance claim rests on unverified superiority rather than direct evidence.

    Authors: We agree that side-by-side measurements on identical hardware, batch size, prompt length, and precision would provide the strongest possible support for the SOTA claim. Many competing models have not reported results under these exact conditions, which limits direct replication. In the revision we will expand the abstract and add a dedicated comparison table that includes all publicly available speed numbers from the literature (with hardware and settings noted) alongside our own measurements, and we will qualify the SOTA statement to reflect the specific devices and conditions used. revision: yes

  2. Referee: Abstract: The claim of 'on par performance compared with a few much larger models' is not supported by specific benchmark scores, exact model sizes, dataset splits, or error bars in the provided text. Without these details or tables showing variance across runs, it is impossible to assess whether the results substantiate the performance parity.

    Authors: The full manuscript contains detailed tables with exact benchmark scores, model sizes (1.4B and 2.7B), dataset splits, and comparisons against larger models. To address the concern, we will revise the abstract to include a concise set of representative scores (e.g., on VQA-v2, GQA, and ScienceQA) together with the corresponding model sizes, while continuing to direct readers to the main-text tables for complete results and any run-to-run variance. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture and benchmark results are self-contained

full rationale

The paper describes MobileVLM as an amalgamation of small LMs (1.4B/2.7B parameters trained from scratch), CLIP-style vision pretraining, and an efficient projector. Performance is reported via direct evaluation on VLM benchmarks and hardware-specific inference measurements (21.5 tokens/s on Snapdragon 888, 65.3 on Jetson Orin). No equations, derivations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the text. Claims rest on external benchmark numbers and measured throughputs rather than any reduction to the paper's own inputs by construction. The SOTA speed assertion is an empirical comparison claim, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work relies on standard deep-learning practices such as CLIP pretraining and projector alignment whose details are not specified here.

pith-pipeline@v0.9.0 · 5499 in / 1094 out tokens · 33560 ms · 2026-05-16T16:29:49.950995+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  2. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    cs.CV 2024-10 unverdicted novelty 7.0

    Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

  3. LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.

  4. Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

    cs.CV 2026-05 unverdicted novelty 6.0

    Distills 3D spatial reasoning from a 7B teacher VLM to a 2.29B student using VGGT encoder, multi-task losses, and Hidden CoT latent tokens, yielding 8.7x lower latency with 54-72% performance retention on ScanNet and ...

  5. Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.

  6. Vision-aligned Latent Reasoning for Multi-modal Large Language Model

    cs.CV 2026-02 unverdicted novelty 6.0

    VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

  7. SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.

  8. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    cs.CV 2024-03 unverdicted novelty 6.0

    MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

  9. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 6.0

    DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...

  10. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

    cs.CV 2024-01 conditional novelty 6.0

    MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.

  11. Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    cs.CL 2026-04 conditional novelty 5.0

    Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.

  12. Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

    cs.CV 2026-04 unverdicted novelty 5.0

    RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.

  13. ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

    cs.CV 2026-04 unverdicted novelty 5.0

    ClickAIXR combines controller-based object selection in XR with on-device VLM inference to enable private, precise multimodal queries about real objects.

  14. Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

    cs.CV 2026-04 unverdicted novelty 5.0

    Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.

  15. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  16. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

  17. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

133 extracted references · 133 canonical work pages · cited by 17 Pith papers · 42 internal anchors

  1. [1]

    An in-depth look at gemini’s language abilities

    Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex B ¨auerle, ´Angel Alexander Cabrera, Kr- ish Dholakia, Chenyan Xiong, and Graham Neubig. An in-depth look at gemini’s language abilities. arXiv preprint arXiv:2312.11444, 2023. 1

  2. [2]

    Flamingo: a visual language model for few-shot learn- ing

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learn- ing. Advances in Neural Information Processing Systems , 35:23716–23736, 2022. 1, 3

  3. [3]

    Openflamingo, Mar

    Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hes- sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Korn- blith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, Mar. 2023. 7

  4. [4]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...

  5. [5]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 1, 2, 3, 7

  6. [6]

    Vlmo: Unified vision- language pre-training with mixture-of-modality-experts

    Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision- language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems , 35:32897–32912, 2022. 2, 3

  7. [7]

    Pythia: A suite for ana- lyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory An- thony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for ana- lyzing large language models across training and scaling. In International Conference on Machine Learning , pages 2397–2430. PMLR, 2023. 2, 6

  8. [8]

    Piqa: Reasoning about physical commonsense in nat- ural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in nat- ural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 7

  9. [9]

    GPT-Neo: Large Scale Autoregressive Lan- guage Modeling with Mesh-Tensorflow, Mar

    Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Lan- guage Modeling with Mesh-Tensorflow, Mar. 2021. If you use this software, please cite it using these metadata. 2

  10. [10]

    A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset

    Michael Boratko, Harshit Padigela, Divyendra Mikkili- neni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapani- pathi, Nicholas Mattei, et al. A systematic classification of knowledge, reasoning, and context within the ARC dataset. arXiv preprint arXiv:1806.00358, 2018. 7

  11. [11]

    Coyo-700m: Image-text pair dataset

    Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kak aobrain/coyo-dataset, 2022. 3

  12. [12]

    Once for all: Train one network and specialize it for efficient deployment

    Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020. 4

  13. [13]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021. 3

  14. [14]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 7

  15. [15]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3, 7

  16. [16]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 2, 3

  17. [17]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large lan- guage models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. 4

  18. [18]

    PaLI-X: On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...

  19. [19]

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022. 2

  20. [20]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, 12 Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 5, 6

  21. [21]

    Make repvgg greater again: A quantization-aware approach

    Xiangxiang Chu, Liang Li, and Bo Zhang. Make repvgg greater again: A quantization-aware approach. In AAAI,

  22. [22]

    Twins: Revisiting the design of spatial attention in vision transformers

    Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. In Adv. Neural Inform. Process. Syst., 2021. 2, 10

  23. [23]

    Conditional positional encodings for vision transformers

    Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2023. 4

  24. [24]

    Fairnas: Re- thinking evaluation fairness of weight sharing neural archi- tecture search

    Xiangxiang Chu, Bo Zhang, and Ruijun Xu. Fairnas: Re- thinking evaluation fairness of weight sharing neural archi- tecture search. In Proceedings of the IEEE/CVF Interna- tional Conference on computer vision, pages 12239–12248,

  25. [25]

    Fair darts: Eliminating unfair advantages in differentiable architecture search

    Xiangxiang Chu, Tianbao Zhou, Bo Zhang, and Jixiang Li. Fair darts: Eliminating unfair advantages in differentiable architecture search. In European conference on computer vision, pages 465–480. Springer, 2020. 4

  26. [26]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 1, 2

  27. [27]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising dsifficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019. 7

  28. [28]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  29. [29]

    Redpajama: An open source recipe to reproduce llama training dataset, 2023

    Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. 5, 6

  30. [30]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 1, 3, 7, 11

  31. [31]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023. 5

  32. [32]

    Embodied question answering

    Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 1–10, 2018. 3

  33. [33]

    Imagenet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 10

  34. [34]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2, 4, 10

  35. [35]

    Glm: General language model pretraining with autoregressive blank infilling.arXiv preprint arXiv:2103.10360, 2021

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling.arXiv preprint arXiv:2103.10360, 2021. 2

  36. [36]

    A survey of embodied ai: From simulators to research tasks

    Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. 3

  37. [37]

    Eva: Exploring the limits of masked visual represen- tation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 19358–19369, 2023. 2

  38. [38]

    Sparsegpt: Massive lan- guage models can be accurately pruned in one-shot, 2023

    Elias Frantar and Dan Alistarh. Sparsegpt: Massive lan- guage models can be accurately pruned in one-shot, 2023. 3

  39. [39]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. 3

  40. [40]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 3, 7

  41. [41]

    A challenger to gpt-4v? early explorations of gemini in visual expertise

    Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Zhang Mengdan, Peixian Chen, Sirui Zhao, Shao- hui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hong- sheng Li, and Xing Sun. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023. 1

  42. [42]

    A framework for few-shot language model evaluation, Sept

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, An- thony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, Sept. 2021. 7

  43. [43]

    Openllama: An open repro- duction of llama, May 2023

    Xinyang Geng and Hao Liu. Openllama: An open repro- duction of llama, May 2023. 2, 6

  44. [44]

    llama.cpp

    Georgi Gerganov. llama.cpp. https://github.com /ggerganov/llama.cpp. [Accessed: 2023-11-07]. 3, 8

  45. [45]

    Gemini: A family of highly capable multimodal models

    Google. Gemini: A family of highly capable multimodal models. 2023. 1

  46. [46]

    Textbooks Are All You Need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023. 2

  47. [47]

    Masked autoencoders are scal- 13 able vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- 13 able vision learners. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 16000–16009, 2022. 2, 10

  48. [48]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. 7

  49. [49]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language mod- els. arXiv preprint arXiv:2203.15556, 2022. 1

  50. [50]

    Searching for mo- bilenetv3

    Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 4

  51. [51]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 1, 8

  52. [52]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 3, 7

  53. [53]

    https://huggingface.co/datas ets/Aeala/ShareGPT_Vicuna_unfiltered

    Huggingface. https://huggingface.co/datas ets/Aeala/ShareGPT_Vicuna_unfiltered. 6

  54. [54]

    Open- clip

    Gabriel Ilharco, Mitchell Wortsman, Ross Rollman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip. July 2021, 2021. If you use this software, please cite it as below. 3

  55. [55]

    Lmdeploy

    InternLM. Lmdeploy. https://github.com/Inter nLM/lmdeploy. [Accessed: 2023-11-07]. 3

  56. [56]

    Batch normalization: Accelerating deep network training by reducing internal co- variate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. In International conference on machine learn- ing, pages 448–456. pmlr, 2015. 5

  57. [57]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International confer- ence on machine learning, pages 4651–4664. PMLR, 2021. 3

  58. [58]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In In- ternational conference on machine learning , pages 4904–

  59. [59]

    All tokens matter: Token labeling for training better vision transform- ers

    Zi-Hang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transform- ers. Advances in neural information processing systems , 34:18590–18602, 2021. 2, 3

  60. [60]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  61. [61]

    Referitgame: Referring to objects in pho- tographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. pages 787–798, 2014. 3

  62. [62]

    Visual Genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123:32–73, 2017. 3

  63. [63]

    SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

    Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018. 2, 4

  64. [64]

    Obelisc: An open web-scale filtered dataset of interleaved image-text documents

    Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023. 7

  65. [65]

    The BigScience corpus: A 1.6 TB composite multilingual dataset

    Hugo Laurenc ¸on, Lucile Saulnier, Thomas Wang, Christo- pher Akiki, Albert Villanova del Moral, Teven Le Scao, Le- andro V on Werra, Chenghao Mou, Eduardo Gonz´alez Pon- ferrada, Huu Nguyen, et al. The BigScience corpus: A 1.6 TB composite multilingual dataset. 2022. 2

  66. [66]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 1, 2, 3, 4, 7

  67. [67]

    Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In International Conference on Machine Learning , pages 12888–12900. PMLR, 2022. 3

  68. [68]

    Norm tweaking: High-performance low-bit quantization of large language models

    Liang Li, Qingyuan Li, Bo Zhang, and Xiangxiang Chu. Norm tweaking: High-performance low-bit quantization of large language models. In AAAI, 2023. 3

  69. [69]

    Robust navigation with language pretraining and stochastic sampling

    Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, and Yejin Choi. Robust navigation with language pretraining and stochastic sampling. arXiv preprint arXiv:1909.02244, 2019. 3

  70. [70]

    Textbooks are all you need ii: phi-1.5 technical report, 2023

    Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023. 2

  71. [71]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 3, 7

  72. [72]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021. 7

  73. [73]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Eur. Conf. Comput. Vis. , pages 740–755. Springer, 2014. 3 14

  74. [75]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023. 10, 11

  75. [76]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023. 1, 3, 4, 6, 7

  76. [77]

    DARTS: Differentiable architecture search

    Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Confer- ence on Learning Representations, 2019. 4

  77. [78]

    Llava-plus: Learning to use tools for creating multi- modal agents

    Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multi- modal agents. arXiv preprint arXiv:2311.05437, 2023. 1, 10

  78. [79]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 10

  79. [80]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 3, 7

  80. [81]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2, 10

Showing first 80 references.