pith. machine review for the scientific record. sign in

arxiv: 2504.05299 · v1 · submitted 2025-04-07 · 💻 cs.AI · cs.CV

Recognition: 2 theorem links

SmolVLM: Redefining small and efficient multimodal models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:18 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords SmolVLMvision-language modelsefficient VLMssmall multimodal modelstokenization strategiesedge deploymentvideo understanding
0
0 comments X

The pith

SmolVLM shows optimized small vision-language models can outperform much larger ones with far less memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SmolVLM as a family of compact multimodal models built for low-resource inference instead of copying the heavy designs of bigger systems. By testing different architectures, tokenization methods, and training data selections, the authors find combinations that deliver strong results on both image and video tasks while keeping memory use low. Their smallest model runs in under 1GB of GPU memory yet beats a model three hundred times its size that was released a year and a half earlier. The largest version in the series matches top-performing models that need twice the memory. These results point to practical multimodal capabilities becoming available on phones and edge devices rather than only on large servers.

Core claim

SmolVLM models are created through systematic choices in architecture, tokenization, and data curation that cut computational costs. The 256M-parameter version uses less than 1GB GPU memory at inference time and outperforms the 80B-parameter Idefics model. The 2.2B-parameter version matches current high-performing vision-language models while using half their memory. The same models also handle video understanding tasks effectively.

What carries the argument

Efficient tokenization strategies paired with targeted architectural changes and curated training data that lower memory demand while preserving task accuracy.

If this is right

  • Multimodal image and video tasks become feasible on mobile and edge hardware without large servers.
  • Energy use for running vision-language models drops enough to support always-on applications.
  • Development focus can shift from ever-larger parameter counts to smarter design for smaller scales.
  • Video comprehension features can be added to devices with tight memory budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Efficiency-focused design may prove more important than raw scale for many real-world multimodal uses.
  • The same tokenization and curation tactics could transfer to other compact models in different domains.
  • Testing these models on even tighter constraints like CPU-only or quantized inference would reveal further limits.

Load-bearing premise

The performance comparisons assume that small and large models were tested under identical evaluation rules and similar training data conditions.

What would settle it

Re-running the exact same benchmarks on SmolVLM-256M and Idefics-80B with matched hardware, prompts, and data splits would show whether the small model truly outperforms.

read the original abstract

Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications. We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints. Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities. Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SmolVLM, a family of compact vision-language models (256M to 2.2B parameters) engineered for low GPU memory usage via optimized architectures, tokenization, and data curation. Central claims are that SmolVLM-256M uses <1GB memory at inference and outperforms the 300x larger Idefics-80B despite an 18-month gap, while the 2.2B variant rivals SOTA VLMs at half the memory; models also show video comprehension.

Significance. If the performance claims hold under matched conditions, the work would be significant for demonstrating that targeted efficiency optimizations can enable competitive multimodal performance at small scales, directly supporting on-device and edge deployment of VLMs where large models are impractical.

major comments (2)
  1. [Abstract] Abstract: The load-bearing claim that SmolVLM-256M outperforms Idefics-80B requires explicit verification that Idefics results were obtained under identical evaluation protocols, datasets, prompt templates, decoding parameters, and task formulations; without this, the 18-month development gap introduces uncontrolled confounds that prevent interpreting the gap as evidence of superior design.
  2. [Results] Results/Experiments (inferred from abstract): Reported benchmark wins are presented without error bars, ablation tables, or training curves, preventing assessment of statistical robustness and isolating the contribution of the claimed architectural and tokenization choices.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'outperforms the 300-times larger Idefics-80B model' would benefit from a parenthetical note on the exact benchmarks and settings used for both models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have incorporated revisions to strengthen the presentation of our results and evaluation protocols.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The load-bearing claim that SmolVLM-256M outperforms Idefics-80B requires explicit verification that Idefics results were obtained under identical evaluation protocols, datasets, prompt templates, decoding parameters, and task formulations; without this, the 18-month development gap introduces uncontrolled confounds that prevent interpreting the gap as evidence of superior design.

    Authors: We appreciate the referee highlighting the need for explicit protocol matching. In the revised manuscript we have added a dedicated evaluation protocol subsection that documents the exact datasets, prompt templates, decoding parameters (temperature, top-p, max new tokens), and task formulations used for all models, including direct alignment with the publicly reported Idefics-80B setup. While we cannot re-execute the 80B model due to resource constraints, the comparisons rely on standardized public benchmarks whose protocols are well-documented in the original papers; we have also added a brief discussion of the temporal gap and why the observed efficiency gains remain attributable to our design choices rather than uncontrolled variables. revision: partial

  2. Referee: [Results] Results/Experiments (inferred from abstract): Reported benchmark wins are presented without error bars, ablation tables, or training curves, preventing assessment of statistical robustness and isolating the contribution of the claimed architectural and tokenization choices.

    Authors: We agree that these elements improve interpretability. The revised results section now includes error bars computed over three independent evaluation runs for the primary benchmarks, a new ablation table isolating the impact of tokenization strategy and architectural modifications, and training curves (loss and validation accuracy) placed in the supplementary material. These additions allow readers to assess statistical robustness and the specific contributions of the optimizations we claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical performance claims

full rationale

The paper reports experimental results from training and evaluating a family of compact VLMs, with performance numbers presented as direct measurements on image and video tasks. No mathematical derivations, equations, or first-principles predictions appear that reduce by construction to fitted inputs, self-citations, or renamed ansatzes. The headline outperformance claim (SmolVLM-256M vs. Idefics-80B) is an empirical comparison rather than a derived quantity; any concerns about benchmark equivalence fall under validity rather than circularity. The design exploration is described as systematic search over configurations, not tautological self-definition. The derivation chain is therefore self-contained as a report of measured outcomes.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger reflects standard VLM training assumptions plus the specific scale choices stated.

free parameters (2)
  • model parameter count
    256M and 2.2B sizes selected to target memory budgets
  • tokenization compression ratio
    Aggressive token reduction factor chosen for efficiency
axioms (1)
  • domain assumption Transformer-based vision-language architecture remains effective at small scale
    Invoked by building on existing VLM designs

pith-pipeline@v0.9.0 · 5587 in / 1197 out tokens · 29201 ms · 2026-05-13T20:18:40.176292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

    cs.CV 2026-05 unverdicted novelty 8.0

    TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

  2. Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

    cs.CV 2026-04 conditional novelty 8.0

    VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

  3. Anny-Fit: All-Age Human Mesh Recovery

    cs.CV 2026-05 unverdicted novelty 7.0

    Anny-Fit jointly optimizes all-age multi-person 3D human meshes in camera coordinates using complementary signals from off-the-shelf depth, segmentation, keypoint, and VLM networks, yielding better reprojection, depth...

  4. Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

    cs.CV 2026-04 conditional novelty 7.0

    Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

  5. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  6. VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

    cs.CV 2026-04 unverdicted novelty 7.0

    VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...

  7. ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

    cs.CV 2026-03 unverdicted novelty 7.0

    ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.

  8. SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.

  9. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  10. VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

    cs.SE 2026-05 unverdicted novelty 6.0

    VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...

  11. Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    The paper releases the Sens-VisualNews dataset of 9,576 annotated news images for sensational image detection and benchmarks open multimodal LLMs on zero-shot and fine-tuned performance.

  12. NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics

    cs.CV 2026-05 unverdicted novelty 6.0

    VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.

  13. Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization

    cs.CV 2026-04 unverdicted novelty 6.0

    Parameter-efficient fine-tuning lets MLLMs serve as effective retrievers for natural-language-guided cross-view geo-localization, beating dual-encoder baselines on GeoText-1652 and CVG-Text while using far fewer train...

  14. BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    VLMs exhibit a consistent 'Texture Bias Cliff' and fail to comprehend pure geometric shapes from boundary contours alone in zero-shot settings.

  15. E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

    cs.CV 2026-04 conditional novelty 6.0

    E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.

  16. An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

    cs.CV 2026-04 unverdicted novelty 6.0

    A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.

  17. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  18. LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

    cs.CV 2026-05 unverdicted novelty 5.0

    LiteMedCoT-VL distills chain-of-thought from a 235B model to 2B VLMs via LoRA, reaching 64.9% accuracy on PMC-VQA and beating a 4B zero-shot baseline by 11 points.

  19. From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

  20. VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

  21. SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    SVD-Prune selects vision tokens via SVD leverage scores to keep performance high even when pruning to only 16-32 tokens.

  22. SALLIE: Safeguarding Against Latent Language & Image Exploits

    cs.CR 2026-04 unverdicted novelty 5.0

    SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.

  23. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  24. WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    WSVD delivers over 1.8x faster VLM decoding via weighted low-rank approximation at fine granularity plus quantization, without accuracy loss.

  25. Lifelong Learning in Vision-Language Models: Enhanced EWC with Cross-Modal Knowledge Retention

    cs.RO 2026-05 unverdicted novelty 4.0

    Enhanced EWC for LVLMs cuts forgetting rates by 78% versus naive training and keeps visual-textual alignment with 15% extra compute.

  26. OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

    cs.RO 2026-04 unverdicted novelty 4.0

    OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

  27. Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification

    cs.CV 2026-04 unverdicted novelty 4.0

    Detection-guided prompting raises small VLM hazard F1 from 34.5% to 50.6% and BERTScore from 0.61 to 0.82 on construction images with only 2.5 ms added latency.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 27 Pith papers · 17 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bińko...

  3. [3]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    https://arxiv.org/abs/2502.02737. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433,

  4. [4]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966,

  5. [5]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

  6. [6]

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles

    doi: 10.1109/ICCV.2019.00439. Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970,

  7. [7]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793,

  8. [8]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024a. 14 Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, e...

  9. [9]

    https: //arxiv.org/abs/2501.12948. Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, E...

  10. [10]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    https://arxiv.org/abs/2409.17146. Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models.arXiv preprint arXiv:2407.11691,

  11. [11]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  12. [12]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024a. Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval wit...

  13. [13]

    H2ovl-mississippi vision language models technical report, 2024.https://arxiv.org/abs/2410.13611

    Shaikat Galib, Shanshan Wang, Guanshuo Xu, Pascal Pfeiffer, Ryan Chesler, Mark Landry, and Sri Satish Ambati. H2ovl-mississippi vision language models technical report, 2024.https://arxiv.org/abs/2410.13611. Philippe Gervais, Asya Fadeeva, and Andrii Maksai. Mathwriting: A dataset for handwritten mathematical expression recognition,

  14. [14]

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

    https://arxiv.org/abs/2404.10690. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913,

  15. [15]

    Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,

    Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, and Xiang Yue. Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale.arXiv preprint arXiv:2412.05237,

  16. [16]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556,

  17. [17]

    Worldsense: Evaluating real-world omni- modal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms, 2025.https://arxiv.org/abs/2502.04326. Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zheng Leng Thai, Kaihuo Zhang, Ch...

  18. [18]

    The stack: 3 tb of permissively licensed source code

    Springer International Publishing. ISBN 978-3-319-46493-0. Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, et al. The stack: 3 tb of permissively licensed source code.arXiv preprint arXiv:2211.15533,

  19. [19]

    Moondream

    Vik Korrapati. Moondream. Online, 2024.https://moondream.ai/. Accessed: 2025-03-27. Hugo Laurençon, Lucile Saulnier, Leo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. In...

  20. [20]

    What matters when building vision-language models?, 2024

    https://openreview.net/forum?id=SKN2hflBIZ. Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246,

  21. [21]

    Building and better understanding vision-language models: insights and future directions

    Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon. Building and better understanding vision-language models: insights and future directions, 2024.https://arxiv.org/abs/2408.12637. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-LM: In search o...

  22. [22]

    Eagle 2: Building post-training data strategies from scratch for frontier vision-language models

    Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al. Eagle 2: Building post-training data strategies from scratch for frontier vision-language models. arXiv preprint arXiv:2501.14818, 2025b. Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-lla...

  23. [23]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee

    https://openreview.net/forum?id=w0H2xGHlkw. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved rea- soning, ocr, and world knowledge, January 2024a.https://llava-vl.github.io/blog/2024-01-30-llava-next/ . 16 Jiajun Liu, Yibing Wang, Hanghang Ma, Xiaoping Wu, Xiaoqi Ma, Xiaoming Wei, Jianbin Jiao, ...

  24. [25]

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al

    https://arxiv.org/abs/2503.22727. Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024a. Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Pet...

  25. [26]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May

  26. [27]

    doi: 10.18653/v1/2022.findings-acl.177

    Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.177. https://aclanthology.org/2022.findings-acl.177. Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. Docvqa: A dataset for vqa on document images. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2199–2208,

  27. [28]

    doi: 10.1109/WACV48630.2021.00225. Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier Biard, Sam Dodge, Philipp Dufter, Bowen Zhang, Dhruti Shah, Xianzhi Du, Futang Peng, Haotian Zhang, Floris Weers, Anton Belyi, Karanjeet Singh, Doug Kang, Ankur Jain, Hongyu He, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei...

  28. [29]

    Vista: Enhancing long-duration and high-resolution video understanding by video spatiotemporal augmentation, 2024.https://arxiv.org/abs/2412.00927

    Weiming Ren, Huan Yang, Jie Min, Cong Wei, and Wenhu Chen. Vista: Enhancing long-duration and high-resolution video understanding by video spatiotemporal augmentation, 2024.https://arxiv.org/abs/2412.00927. Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, et al. Eagl...

  29. [30]

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao

    https://arxiv.org/abs/1609.05158. Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding.arXiv preprint arXiv:2409.14485,

  30. [31]

    Moviechat: From dense to- ken to sparse memory for long video understanding.arXiv preprint arXiv:2307.16449, 2023

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, and Gaoang Wang. Moviechat: From dense token to sparse memory for long video understanding, 2024.https://arxiv.org/abs/2307.16449. Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xia...

  31. [32]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024a. Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understanding wit...

  32. [33]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    https://arxiv.org/abs/2412.10302. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296,

  33. [34]

    Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464, 2024

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464,

  34. [35]

    Vript: A video is worth thousands of words, 2024.https://arxiv.org/abs/2406.06040

    Dongjie Yang, Suyuan Huang, Chengqiang Lu, Xiaodong Han, Haoxin Zhang, Yan Gao, Yao Hu, and Hai Zhao. Vript: A video is worth thousands of words, 2024.https://arxiv.org/abs/2406.06040. Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Sh...

  35. [36]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    https://arxiv.org/abs/2408.01800. Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126,

  36. [37]

    arXiv preprint arXiv:2309.05653 , year=

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, 19 Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ...

  37. [38]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106,

  38. [39]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video instruction tuning with synthetic data, 2024.https://arxiv.org/abs/2410.02713. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena....

  39. [40]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264,

  40. [41]

    Video-star: Self-training enables video instruction tuning with any supervision, 2024a.https://arxiv.org/abs/2407.06189

    Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, and Serena Yeung-Levy. Video-star: Self-training enables video instruction tuning with any supervision, 2024a.https://arxiv.org/abs/2407.06189. Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, et al. Apollo...