arxiv: 2312.16886 · v2 · submitted 2023-12-28 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

Xiangxiang Chu , Limeng Qiao , Xinyang Lin , Shuang Xu , Yang Yang , Yiming Hu , Fei Wei , Xinyu Zhang

show 3 more authors

Bo Zhang Xiaolin Wei Chunhua Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision language modelmobile devicesmultimodal AIefficient inferencesmall language modelsCLIP pretrainingon-device assistant

0 comments

The pith

A vision-language model with 1.4B and 2.7B parameters matches larger models while running at 65 tokens per second on mobile GPUs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MobileVLM combines small language models trained from scratch with a CLIP-pretrained vision model and an efficient projector to create a multimodal assistant suitable for mobile devices. The models achieve performance comparable to much larger vision-language systems on standard benchmarks. They also deliver high inference speeds, reaching 21.5 tokens per second on a Qualcomm Snapdragon 888 CPU and 65.3 tokens per second on an NVIDIA Jetson Orin GPU. A reader would care because this enables capable AI assistants to operate directly on phones without cloud connectivity or high resource demands. The paper plans to release the code to make the approach accessible.

Core claim

MobileVLM is an amalgamation of mobile-oriented designs including 1.4B and 2.7B parameter language models trained from scratch, a CLIP-fashion pre-trained multimodal vision model, and an efficient projector for cross-modality interaction. It demonstrates on par performance with much larger models on VLM benchmarks and achieves state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 and 65.3 on Jetson Orin.

What carries the argument

The efficient projector enabling interaction between the small language models and the CLIP vision model.

If this is right

Mobile vision-language applications can run locally on devices with limited compute.
Smaller models can deliver competitive multimodal performance when using targeted pretraining and projection techniques.
Open release of such models supports further development in efficient on-device AI.
High token throughput allows for responsive user interactions in real-time mobile scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such designs might reduce the need for large data centers by shifting multimodal processing to edge devices.
Similar combinations could be tested with other vision encoders beyond CLIP for potential gains.
The speed improvements suggest viability for interactive applications like real-time image description on phones.

Load-bearing premise

The combination of scratch-trained small LMs, CLIP vision pretraining, and efficient projector produces competitive benchmark scores.

What would settle it

Benchmark results where MobileVLM scores fall well below those of larger models on the same VLM tests would disprove the on-par performance.

read the original abstract

We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MobileVLM gives concrete token rates on Snapdragon and Jetson hardware for 1.4B-2.7B models that stay competitive with larger VLMs, but the SOTA speed claim lacks direct side-by-side tests on the same devices.

read the letter

MobileVLM combines a scratch-trained 1.4B or 2.7B language model, a CLIP-style vision encoder, and a lightweight projector to target mobile deployment. The concrete contribution is the reported inference speeds: 21.5 tokens per second on a Snapdragon 888 CPU and 65.3 on a Jetson Orin GPU, paired with benchmark results described as on par with much larger models. Releasing the code adds practical value for anyone who wants to test or extend it.

Referee Report

2 major / 2 minor

Summary. The paper presents MobileVLM, a multimodal vision-language model for mobile devices that combines 1.4B and 2.7B parameter language models trained from scratch, a CLIP-pretrained vision encoder, and an efficient projector for cross-modality interaction. It reports competitive performance on standard VLM benchmarks relative to larger models and claims state-of-the-art inference speeds of 21.5 tokens/s on a Qualcomm Snapdragon 888 CPU and 65.3 tokens/s on an NVIDIA Jetson Orin GPU, with code to be released.

Significance. If the empirical claims hold under scrutiny, the work offers a practical open-source contribution toward deploying capable VLMs on resource-constrained mobile hardware, filling a gap between large-scale models and edge deployment. The emphasis on hardware-specific throughput measurements and the planned code release support reproducibility and further research in efficient multimodal systems.

major comments (2)

Abstract: The state-of-the-art inference speed claims of 21.5 tokens/s on Snapdragon 888 and 65.3 tokens/s on Jetson Orin are presented without side-by-side token/s measurements for competing VLMs (such as LLaVA variants or Phi-2) on identical hardware, batch size, prompt length, and precision settings. This absence directly undermines the SOTA assertion, as the central performance claim rests on unverified superiority rather than direct evidence.
Abstract: The claim of 'on par performance compared with a few much larger models' is not supported by specific benchmark scores, exact model sizes, dataset splits, or error bars in the provided text. Without these details or tables showing variance across runs, it is impossible to assess whether the results substantiate the performance parity.

minor comments (2)

Abstract: Typo in hardware name ('Jeston Orin' should read 'Jetson Orin').
Abstract: The list of 'several typical VLM benchmarks' is not enumerated, reducing clarity on the evaluation scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments, which help clarify the presentation of our results. We address each major comment below and will revise the manuscript to improve the abstract's precision while maintaining the integrity of our empirical claims.

read point-by-point responses

Referee: Abstract: The state-of-the-art inference speed claims of 21.5 tokens/s on Snapdragon 888 and 65.3 tokens/s on Jetson Orin are presented without side-by-side token/s measurements for competing VLMs (such as LLaVA variants or Phi-2) on identical hardware, batch size, prompt length, and precision settings. This absence directly undermines the SOTA assertion, as the central performance claim rests on unverified superiority rather than direct evidence.

Authors: We agree that side-by-side measurements on identical hardware, batch size, prompt length, and precision would provide the strongest possible support for the SOTA claim. Many competing models have not reported results under these exact conditions, which limits direct replication. In the revision we will expand the abstract and add a dedicated comparison table that includes all publicly available speed numbers from the literature (with hardware and settings noted) alongside our own measurements, and we will qualify the SOTA statement to reflect the specific devices and conditions used. revision: yes
Referee: Abstract: The claim of 'on par performance compared with a few much larger models' is not supported by specific benchmark scores, exact model sizes, dataset splits, or error bars in the provided text. Without these details or tables showing variance across runs, it is impossible to assess whether the results substantiate the performance parity.

Authors: The full manuscript contains detailed tables with exact benchmark scores, model sizes (1.4B and 2.7B), dataset splits, and comparisons against larger models. To address the concern, we will revise the abstract to include a concise set of representative scores (e.g., on VQA-v2, GQA, and ScienceQA) together with the corresponding model sizes, while continuing to direct readers to the main-text tables for complete results and any run-to-run variance. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture and benchmark results are self-contained

full rationale

The paper describes MobileVLM as an amalgamation of small LMs (1.4B/2.7B parameters trained from scratch), CLIP-style vision pretraining, and an efficient projector. Performance is reported via direct evaluation on VLM benchmarks and hardware-specific inference measurements (21.5 tokens/s on Snapdragon 888, 65.3 on Jetson Orin). No equations, derivations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the text. Claims rest on external benchmark numbers and measured throughputs rather than any reduction to the paper's own inputs by construction. The SOTA speed assertion is an empirical comparison claim, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work relies on standard deep-learning practices such as CLIP pretraining and projector alignment whose details are not specified here.

pith-pipeline@v0.9.0 · 5499 in / 1094 out tokens · 33560 ms · 2026-05-16T16:29:49.950995+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
cs.CV 2024-10 unverdicted novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.
Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
cs.CV 2026-05 unverdicted novelty 6.0

Distills 3D spatial reasoning from a 7B teacher VLM to a 2.29B student using VGGT encoder, multi-task losses, and Hidden CoT latent tokens, yielding 8.7x lower latency with 54-72% performance retention on ScanNet and ...
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model
cs.CV 2026-02 unverdicted novelty 6.0

VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
cs.CV 2024-01 conditional novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
cs.CL 2026-04 conditional novelty 5.0

Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis
cs.CV 2026-04 unverdicted novelty 5.0

RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.
ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality
cs.CV 2026-04 unverdicted novelty 5.0

ClickAIXR combines controller-based object selection in XR with on-device VLM inference to enable private, precise multimodal queries about real objects.
Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs
cs.CV 2026-04 unverdicted novelty 5.0

Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

133 extracted references · 133 canonical work pages · cited by 17 Pith papers · 42 internal anchors

[1]

An in-depth look at gemini’s language abilities

Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex B ¨auerle, ´Angel Alexander Cabrera, Kr- ish Dholakia, Chenyan Xiong, and Graham Neubig. An in-depth look at gemini’s language abilities. arXiv preprint arXiv:2312.11444, 2023. 1

work page arXiv 2023
[2]

Flamingo: a visual language model for few-shot learn- ing

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learn- ing. Advances in Neural Information Processing Systems , 35:23716–23736, 2022. 1, 3

work page 2022
[3]

Openflamingo, Mar

Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hes- sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Korn- blith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, Mar. 2023. 7

work page 2023
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfe...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 1, 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Vlmo: Unified vision- language pre-training with mixture-of-modality-experts

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision- language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems , 35:32897–32912, 2022. 2, 3

work page 2022
[7]

Pythia: A suite for ana- lyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory An- thony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for ana- lyzing large language models across training and scaling. In International Conference on Machine Learning , pages 2397–2430. PMLR, 2023. 2, 6

work page 2023
[8]

Piqa: Reasoning about physical commonsense in nat- ural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in nat- ural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 7

work page 2020
[9]

GPT-Neo: Large Scale Autoregressive Lan- guage Modeling with Mesh-Tensorflow, Mar

Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregressive Lan- guage Modeling with Mesh-Tensorflow, Mar. 2021. If you use this software, please cite it using these metadata. 2

work page 2021
[10]

A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset

Michael Boratko, Harshit Padigela, Divyendra Mikkili- neni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapani- pathi, Nicholas Mattei, et al. A systematic classification of knowledge, reasoning, and context within the ARC dataset. arXiv preprint arXiv:1806.00358, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Coyo-700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kak aobrain/coyo-dataset, 2022. 3

work page 2022
[12]

Once for all: Train one network and specialize it for efficient deployment

Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020. 4

work page 2020
[13]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021. 3

work page 2021
[14]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large lan- guage models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

PaLI-X: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...

work page 2023
[19]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, 12 Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 5, 6

work page 2023
[21]

Make repvgg greater again: A quantization-aware approach

Xiangxiang Chu, Liang Li, and Bo Zhang. Make repvgg greater again: A quantization-aware approach. In AAAI,

work page
[22]

Twins: Revisiting the design of spatial attention in vision transformers

Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haib- ing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting the design of spatial attention in vision transformers. In Adv. Neural Inform. Process. Syst., 2021. 2, 10

work page 2021
[23]

Conditional positional encodings for vision transformers

Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, and Chunhua Shen. Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2023. 4

work page 2023
[24]

Fairnas: Re- thinking evaluation fairness of weight sharing neural archi- tecture search

Xiangxiang Chu, Bo Zhang, and Ruijun Xu. Fairnas: Re- thinking evaluation fairness of weight sharing neural archi- tecture search. In Proceedings of the IEEE/CVF Interna- tional Conference on computer vision, pages 12239–12248,

work page
[25]

Fair darts: Eliminating unfair advantages in differentiable architecture search

Xiangxiang Chu, Tianbao Zhou, Bo Zhang, and Jixiang Li. Fair darts: Eliminating unfair advantages in differentiable architecture search. In European conference on computer vision, pages 465–480. Springer, 2020. 4

work page 2020
[26]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising dsifficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019. 7

work page internal anchor Pith review Pith/arXiv arXiv 1905
[28]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Redpajama: An open source recipe to reproduce llama training dataset, 2023

Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. 5, 6

work page 2023
[30]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023. 1, 3, 7, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Embodied question answering

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In Proceedings of the IEEE conference on com- puter vision and pattern recognition, pages 1–10, 2018. 3

work page 2018
[33]

Imagenet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 10

work page 2009
[34]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2, 4, 10

work page internal anchor Pith review Pith/arXiv arXiv 2010
[35]

Glm: General language model pretraining with autoregressive blank infilling.arXiv preprint arXiv:2103.10360, 2021

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling.arXiv preprint arXiv:2103.10360, 2021. 2

work page arXiv 2021
[36]

A survey of embodied ai: From simulators to research tasks

Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. 3

work page 2022
[37]

Eva: Exploring the limits of masked visual represen- tation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 19358–19369, 2023. 2

work page 2023
[38]

Sparsegpt: Massive lan- guage models can be accurately pruned in one-shot, 2023

Elias Frantar and Dan Alistarh. Sparsegpt: Massive lan- guage models can be accurately pruned in one-shot, 2023. 3

work page 2023
[39]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

A challenger to gpt-4v? early explorations of gemini in visual expertise

Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Zhang Mengdan, Peixian Chen, Sirui Zhao, Shao- hui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hong- sheng Li, and Xing Sun. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023. 1

work page arXiv 2023
[42]

A framework for few-shot language model evaluation, Sept

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, An- thony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, Sept. 2021. 7

work page 2021
[43]

Openllama: An open repro- duction of llama, May 2023

Xinyang Geng and Hao Liu. Openllama: An open repro- duction of llama, May 2023. 2, 6

work page 2023
[44]

llama.cpp

Georgi Gerganov. llama.cpp. https://github.com /ggerganov/llama.cpp. [Accessed: 2023-11-07]. 3, 8

work page 2023
[45]

Gemini: A family of highly capable multimodal models

Google. Gemini: A family of highly capable multimodal models. 2023. 1

work page 2023
[46]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C´esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Masked autoencoders are scal- 13 able vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- 13 able vision learners. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 16000–16009, 2022. 2, 10

work page 2022
[48]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020. 7

work page internal anchor Pith review Pith/arXiv arXiv 2009
[49]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language mod- els. arXiv preprint arXiv:2203.15556, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Searching for mo- bilenetv3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mo- bilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019. 4

work page 2019
[51]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2021. 1, 8

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 3, 7

work page 2019
[53]

https://huggingface.co/datas ets/Aeala/ShareGPT_Vicuna_unfiltered

Huggingface. https://huggingface.co/datas ets/Aeala/ShareGPT_Vicuna_unfiltered. 6

work page
[54]

Open- clip

Gabriel Ilharco, Mitchell Wortsman, Ross Rollman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip. July 2021, 2021. If you use this software, please cite it as below. 3

work page 2021
[55]

Lmdeploy

InternLM. Lmdeploy. https://github.com/Inter nLM/lmdeploy. [Accessed: 2023-11-07]. 3

work page 2023
[56]

Batch normalization: Accelerating deep network training by reducing internal co- variate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. In International conference on machine learn- ing, pages 448–456. pmlr, 2015. 5

work page 2015
[57]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. In International confer- ence on machine learning, pages 4651–4664. PMLR, 2021. 3

work page 2021
[58]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In In- ternational conference on machine learning , pages 4904–

work page
[59]

All tokens matter: Token labeling for training better vision transform- ers

Zi-Hang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transform- ers. Advances in neural information processing systems , 34:18590–18602, 2021. 2, 3

work page 2021
[60]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[61]

Referitgame: Referring to objects in pho- tographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in pho- tographs of natural scenes. pages 787–798, 2014. 3

work page 2014
[62]

Visual Genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123:32–73, 2017. 3

work page 2017
[63]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[64]

Obelisc: An open web-scale filtered dataset of interleaved image-text documents

Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, et al. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023. 7

work page arXiv 2023
[65]

The BigScience corpus: A 1.6 TB composite multilingual dataset

Hugo Laurenc ¸on, Lucile Saulnier, Thomas Wang, Christo- pher Akiki, Albert Villanova del Moral, Teven Le Scao, Le- andro V on Werra, Chenghao Mou, Eduardo Gonz´alez Pon- ferrada, Huu Nguyen, et al. The BigScience corpus: A 1.6 TB composite multilingual dataset. 2022. 2

work page 2022
[66]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 1, 2, 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In International Conference on Machine Learning , pages 12888–12900. PMLR, 2022. 3

work page 2022
[68]

Norm tweaking: High-performance low-bit quantization of large language models

Liang Li, Qingyuan Li, Bo Zhang, and Xiangxiang Chu. Norm tweaking: High-performance low-bit quantization of large language models. In AAAI, 2023. 3

work page 2023
[69]

Robust navigation with language pretraining and stochastic sampling

Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, Asli Celikyilmaz, Jianfeng Gao, Noah Smith, and Yejin Choi. Robust navigation with language pretraining and stochastic sampling. arXiv preprint arXiv:1909.02244, 2019. 3

work page arXiv 1909
[70]

Textbooks are all you need ii: phi-1.5 technical report, 2023

Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023. 2

work page 2023
[71]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models. arXiv preprint arXiv:2305.10355, 2023. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021. 7

work page internal anchor Pith review Pith/arXiv arXiv 2021
[73]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Eur. Conf. Comput. Vis. , pages 740–755. Springer, 2014. 3 14

work page 2014
[75]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023. 10, 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023. 1, 3, 4, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

DARTS: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In International Confer- ence on Learning Representations, 2019. 4

work page 2019
[78]

Llava-plus: Learning to use tools for creating multi- modal agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, et al. Llava-plus: Learning to use tools for creating multi- modal agents. arXiv preprint arXiv:2311.05437, 2023. 1, 10

work page arXiv 2023
[79]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[80]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi- modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2, 10

work page 2021

Showing first 80 references.