NVILA: Efficient Frontier Visual Language Models

An-Chieh Cheng, Baifeng Shi, Cheng-Yu Hsieh, Dacheng Li, Daguang Xu, De-An Huang, Haocheng Xi, Hongxu Yin, Jan Kautz, Jinyi Hu, Ligeng Zhu, Pavlo Molchanov, Ranjay Krishna, Shang Yang, Shiyi Cao, Sifei Liu, Song Han, Vishwesh Nath, Xiaolong Wang, Xiuyu Li, Yao Lu, Yukang Chen, Yuming Lou, Yunhao Fang, Yuxian Gu, Zhijian Liu, Zhuoyang Zhang

classification 💻 cs.CV

keywords nvilaaccuracyefficiencymodelsvisualvlmslanguagelatency

0 comments

read the original abstract

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to jointly optimize efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We further conduct a systematic investigation that enhances NVILA's efficiency throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training cost by 1.9-5.1x, prefilling latency by 1.6-2.2x, and decoding latency by 1.2-2.8x. We release our code and models to facilitate reproducibility.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.