arxiv: 2509.23661 · v3 · submitted 2025-09-28 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

Xiang An , Yin Xie , Kaicheng Yang , Wenkang Zhang , Xiuwei Zhao , Zheng Cheng , Yirui Wang , Songcen Xu

show 15 more authors

Changrui Chen Didi Zhu Chunsheng Wu Huajie Tan Chunyuan Li Jing Yang Jie Yu Xiyao Wang Bin Qin Yumeng Wang Zizhen Yan Ziyong Feng Ziwei Liu Bo Li Jiankang Deng

Authors on Pith no claims yet

Pith reviewed 2026-05-12 10:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal modelsvision-language modelsopen source trainingdataset curationefficient trainingreinforcement learning post-trainingLLaVAbenchmark evaluation

0 comments

The pith

LLaVA-OneVision-1.5 builds competitive multimodal models from scratch using an open end-to-end framework on 85M curated pretraining examples and 22M instructions for under $16,000.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LLaVA-OneVision-1.5 as a family of large multimodal models constructed entirely from scratch with an open, efficient, and reproducible training process. It supplies an 85 million concept-balanced pretraining dataset and a 22 million instruction dataset, then applies an offline parallel data packing strategy to train models within a $16,000 budget. A final lightweight reinforcement learning stage elicits chain-of-thought reasoning. The resulting 8 billion parameter model outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks while the 4 billion parameter model surpasses Qwen2.5-VL-3B on all 27. A sympathetic reader would care because the work shows that high-performing vision-language models can be developed and shared without relying on closed proprietary data or massive compute resources.

Core claim

LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks through an open end-to-end efficient training framework that combines large-scale curated datasets of 85M concept-balanced pretraining examples and 22M instruction examples, offline parallel data packing to stay within a $16,000 budget, and RL-based post-training that unlocks robust chain-of-thought reasoning, with the 8B model outperforming Qwen2.5-VL-7B on 18 of 27 benchmarks and the 4B model surpassing Qwen2.5-VL-3B on all 27 benchmarks.

What carries the argument

The complete open end-to-end training framework that integrates concept-balanced pretraining data, instruction data, offline parallel data packing for cost efficiency, and a lightweight RL post-training stage to improve multimodal reasoning.

If this is right

High-quality curated datasets can deliver strong multimodal performance even when total training spend is limited to $16,000.
A lightweight RL post-training stage can elicit better chain-of-thought reasoning on complex multimodal tasks without large additional compute.
Smaller 4B-scale models can exceed the benchmark results of larger closed models when trained with this framework.
Fully open data and code release lowers the barrier for reproducible multimodal research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If other groups replicate the data curation steps, similar performance levels may become accessible to teams with modest budgets.
The results point to data quality and balancing as potentially more decisive than raw data volume in multimodal pretraining.
The framework could be tested on additional vision-language tasks or extended to new modalities to check whether the efficiency gains generalize.

Load-bearing premise

The 85M concept-balanced pretraining dataset and 22M instruction dataset are of sufficiently higher quality than prior data sources to produce the reported gains, and that benchmark comparisons are free of selection effects or evaluation differences.

What would settle it

An independent reproduction that trains the same model sizes on the released datasets and framework but fails to match the claimed outperformance margins over Qwen2.5-VL-7B and Qwen2.5-VL-3B on the 27 benchmarks.

read the original abstract

We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. (4) RL-based Post-training: We unlock the model's latent potential through a lightweight RL stage, effectively eliciting robust chain-of-thought reasoning to significantly boost performance on complex multimodal reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Open LLaVA training recipe with new 85M/22M datasets and cheap packing shows competitive benchmark wins, but lacks ablations to tie gains to the data curation.

read the letter

The main point is that this paper ships a complete open framework for training LLaVA-style models from scratch on a $16k budget. They release an 85M concept-balanced pretraining set and a 22M instruction set, use offline parallel packing for efficiency, add a light RL stage for reasoning, and report the 8B model beating Qwen2.5-VL-7B on 18 of 27 benchmarks while the 4B version wins all of them. That combination of openness, low cost, and concrete numbers is the useful part here. The new elements are the specific dataset scales and curation approach plus the packing strategy; the rest builds on the existing LLaVA line and standard multimodal practices. The RL post-training step is a straightforward addition that appears to help on harder reasoning tasks. The soft spot is the missing evidence for why the results improved. The abstract and description give high-level curation details but no controlled ablations that swap in prior open data mixtures while holding architecture and recipe fixed. Without those, it's difficult to attribute the benchmark deltas to the new datasets rather than training differences or evaluation choices. Benchmark protocols, exact subsets, and error bars also aren't detailed enough in the provided summary to rule out small protocol shifts. This paper is for groups that want to train or fine-tune their own multimodal models without big compute. It is worth sending to peer review because the open release and empirical claims are concrete enough for others to test and extend, even if the causal story needs tightening.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LLaVA-OneVision-1.5, a family of open large multimodal models (LMMs) trained entirely from scratch. It describes construction of an 85M concept-balanced pretraining dataset (LLaVA-OneVision-1.5-Mid-Traning) and a 22M instruction dataset (LLaVA-OneVision-1.5-Instruct), an efficient end-to-end training framework using offline parallel data packing that completes within a $16,000 budget, state-of-the-art results where the 8B variant outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks and the 4B variant surpasses Qwen2.5-VL-3B on all 27, and a lightweight RL post-training stage to elicit chain-of-thought reasoning on complex multimodal tasks.

Significance. If the performance claims hold under controlled evaluation, the work would provide a fully open, low-cost, and reproducible pipeline for training competitive vision-language models. This could meaningfully advance democratization of multimodal research by releasing curated datasets, training code, and an RL stage that improves reasoning, while demonstrating that high performance is achievable without massive compute.

major comments (3)

[Abstract and Experimental Results] Abstract and Experimental Results section: The central performance claims (8B model beats Qwen2.5-VL-7B on 18/27 benchmarks; 4B beats Qwen2.5-VL-3B on all 27) are presented without reported error bars, details on benchmark subset selection, prompt templates, decoding parameters, or confirmation that comparisons were run under identical conditions. This leaves open the possibility that observed deltas arise from evaluation differences rather than the claimed framework or data.
[Dataset Curation and Training Framework] Dataset Curation and Training Framework sections: Performance gains are attributed to the 85M concept-balanced pretraining set and 22M instruction set, yet no ablation studies are described that hold architecture, training recipe, and compute fixed while swapping in prior open datasets (e.g., LLaVA-1.5 or ShareGPT4V mixtures). Without such controlled comparisons, the claim that these specific curated corpora are materially higher-quality and responsible for the results cannot be verified.
[RL-based Post-training] RL-based Post-training section: The manuscript states that a lightweight RL stage significantly boosts performance on complex reasoning tasks, but provides no quantitative before/after results on the 27 benchmarks, no details on the reward model or RL algorithm, and no comparison to standard supervised fine-tuning baselines. This makes it impossible to assess the incremental contribution of the RL component.

minor comments (2)

[Abstract] Abstract: 'LLaVA-OneVision-1.5-Mid-Traning' appears to be a typographical error for 'Mid-Training'.
[Experimental Results] The manuscript would benefit from an explicit table listing all 27 benchmarks, the exact scores for LLaVA-OneVision-1.5 variants and the Qwen2.5-VL baselines, and any data-exclusion rules applied during evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications based on our open framework and outlining revisions to improve the manuscript's rigor and reproducibility.

read point-by-point responses

Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: The central performance claims (8B model beats Qwen2.5-VL-7B on 18/27 benchmarks; 4B beats Qwen2.5-VL-3B on all 27) are presented without reported error bars, details on benchmark subset selection, prompt templates, decoding parameters, or confirmation that comparisons were run under identical conditions. This leaves open the possibility that observed deltas arise from evaluation differences rather than the claimed framework or data.

Authors: We thank the referee for highlighting this. All evaluations were conducted under identical conditions using our publicly released evaluation code and the same harness for baselines. We will revise the Experimental Results section and add a dedicated appendix detailing benchmark subsets, exact prompt templates, decoding parameters (e.g., temperature=0, top_p=1.0, greedy decoding), and confirmation of controlled settings. Error bars are not reported as single-run results are standard for large-scale training; we will add a note on this limitation and include inference-seed variance for representative benchmarks in the revision. revision: yes
Referee: [Dataset Curation and Training Framework] Dataset Curation and Training Framework sections: Performance gains are attributed to the 85M concept-balanced pretraining set and 22M instruction set, yet no ablation studies are described that hold architecture, training recipe, and compute fixed while swapping in prior open datasets (e.g., LLaVA-1.5 or ShareGPT4V mixtures). Without such controlled comparisons, the claim that these specific curated corpora are materially higher-quality and responsible for the results cannot be verified.

Authors: We agree that explicit ablations would strengthen attribution. Full-scale ablations holding architecture, recipe, and compute fixed are not feasible within our $16,000 budget and timeline. We will revise the Dataset Curation section to elaborate on the concept-balancing procedure and provide qualitative/quantitative comparisons to prior mixtures. The datasets are fully open-sourced, enabling the community to run such controlled ablations independently. A limited small-scale ablation on data subsets will be included if space allows. revision: partial
Referee: [RL-based Post-training] RL-based Post-training section: The manuscript states that a lightweight RL stage significantly boosts performance on complex reasoning tasks, but provides no quantitative before/after results on the 27 benchmarks, no details on the reward model or RL algorithm, and no comparison to standard supervised fine-tuning baselines. This makes it impossible to assess the incremental contribution of the RL component.

Authors: We acknowledge the need for more quantitative evidence. In the revision, we will add a table with before/after performance on all 27 benchmarks, details on the reward model (trained via preference data), the RL algorithm employed, and direct comparisons against an SFT-only baseline. This will quantify the incremental gains from the lightweight RL stage while keeping the overall compute low. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivations or equations present.

full rationale

The manuscript describes dataset construction (85M pretraining + 22M instruction), an efficient training framework, RL post-training, and benchmark results without any equations, mathematical derivations, or claimed first-principles reductions. Performance statements (e.g., outperforming Qwen2.5-VL variants on 18/27 or 27/27 benchmarks) are direct empirical comparisons, not outputs derived from fitted parameters or self-referential definitions. No load-bearing self-citations reduce to unverified prior claims within a derivation chain, as no such chain exists. The central claims rest on data curation and training details rather than any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the unverified quality and balance of the newly curated datasets and the effectiveness of the offline packing strategy; no explicit free parameters, invented entities, or non-standard axioms are stated in the abstract.

axioms (2)

domain assumption Curated large-scale datasets of the stated sizes and balance produce higher-quality multimodal models than prior alternatives
Implicit in the claim that the 85M and 22M datasets enable SOTA performance
domain assumption Standard supervised and RL training procedures on these data yield the reported benchmark gains
Underlying the performance and RL post-training claims

pith-pipeline@v0.9.0 · 5685 in / 1530 out tokens · 44083 ms · 2026-05-12T10:48:08.389054+00:00 · methodology

discussion (0)

Forward citations

Cited by 35 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
cs.CV 2026-04 unverdicted novelty 8.0

PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
cs.CV 2026-05 unverdicted novelty 7.0

ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment
cs.CV 2026-05 unverdicted novelty 7.0

GameScope provides 4,048 multi-codec gaming videos with MOS ratings and attribute annotations, claimed as the first comprehensive dataset for gaming video quality assessment across codecs and content types.
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results acro...
Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios
cs.CV 2026-04 unverdicted novelty 7.0

DailyClue is a new benchmark that requires MLLMs to actively seek visual clues in authentic daily scenarios across four domains and 16 subtasks before performing reasoning.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
cs.CV 2026-04 unverdicted novelty 7.0

BARD bridges autoregressive and diffusion VLMs with progressive block merging plus stage-wise intra-diffusion distillation, delivering 3x speedup and new SOTA on open dVLMs using under 4.4M data points.
Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing
cs.CV 2026-04 unverdicted novelty 7.0

BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
Token Warping Helps MLLMs Look from Nearby Viewpoints
cs.CV 2026-04 unverdicted novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
cs.CV 2026-03 unverdicted novelty 7.0

ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Instruction token embeddings encode visual information that can be leveraged to detect object hallucinations in MLLMs via a new combined score outperforming prior detectors.
Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration
cs.CV 2026-05 unverdicted novelty 6.0

A training-free attention-guided debiasing framework mitigates position bias in MLLM multi-image retrieval by exploiting the observed mismatch between biased logits and aligned attention maps, yielding over 40% accura...
LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

LDDR proposes a linear DPP-based dynamic-resolution frame sampler that achieves 3x speedup and up to 2.5-point gains on video MLLM benchmarks by selecting non-redundant frames and allocating tokens accordingly.
Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection
cs.CV 2026-05 unverdicted novelty 6.0

The paper releases the Sens-VisualNews dataset of 9,576 annotated news images for sensational image detection and benchmarks open multimodal LLMs on zero-shot and fine-tuned performance.
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
Causal Probing for Internal Visual Representations in Multimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 6.0

Activation steering reveals localized encoding for entities versus distributed encoding for abstract concepts in MLLMs, identifying depth as key for the latter and a perception-reasoning disconnect.
SurgCheck: Do Vision-Language Models Really Look at Images in Surgical VQA?
cs.CV 2026-05 unverdicted novelty 6.0

SurgCheck benchmark reveals that vision-language models for surgical VQA often depend on linguistic shortcuts rather than visual reasoning, shown by consistent performance drops on less-biased questions.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
cs.CV 2026-04 conditional novelty 6.0

SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
Boosting Visual Instruction Tuning with Self-Supervised Guidance
cs.CV 2026-04 unverdicted novelty 6.0

Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
Precise Shield: Explaining and Aligning VLLM Safety via Neuron-Level Guidance
cs.CV 2026-04 unverdicted novelty 6.0

Precise Shield identifies safety neurons in VLLMs via activation contrasts and aligns only them with gradient masking, boosting safety, preserving generalization, and enabling zero-shot cross-lingual and cross-modal transfer.
Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.
Small Vision-Language Models are Smart Compressors for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 5.0

Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 5.0

PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
cs.CV 2026-05 unverdicted novelty 5.0

MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
cs.LG 2026-04 unverdicted novelty 5.0

SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and...
Steering the Verifiability of Multimodal AI Hallucinations
cs.AI 2026-04 unverdicted novelty 5.0

Researchers create a human-labeled dataset of obvious and elusive multimodal hallucinations and use learned activation-space probes to control their verifiability in MLLMs.
ZAYA1-VL-8B Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 32 Pith papers · 8 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Gary Chan

Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, and S.-H. Gary Chan. Revisiting referring expression comprehension evaluation in the era of large multimodal models. arXiv:2406.16866, 2024a. Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the...

work page arXiv
[3]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv:2505.07062,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Improved baselines with visual instruction tuning

16 Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR, 2024a. Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024b. URLhttps://llava-vl. github.io/blog/2024-01-30-llava-next/. Yuliang Liu, Zhang ...

work page 2024
[6]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[8]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understanding, localization, and dense features....

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InNeurIPS, 2025a. Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset. InNeurIPS, 2024a. P...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence

Jiaxing Zhang, Ruyi Gan, Junjie Wang, Yuxiang Zhang, Lin Zhang, Ping Yang, Xinyu Gao, Ziwei Wu, Xiaoqun Dong, Junqing He, Jianheng Zhuo, Qi Yang, Yongfeng Huang, Xiayu Li, Yanghan Wu, Junyu Lu, Xinyu Zhu, Weifeng Chen, Ting Han, Kunhao Pan, Rui Wang, Hao Wang, Xiaojun Wu, Zhongshen Zeng, and Chongpei Chen. Fengshenbang 1.0: Being the foundation of chinese...

work page arXiv
[11]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models. InACL, 2025a. Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang ...

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qwen2.5-VL with Same LLM ToenableafaircomparisonwithQwen2.5-VL,wetrainLLaVA-Onevision-1.5-3BbasedonQwen2.5- 3B-Instruct

19 A LLaVA-OV-1.5 vs. Qwen2.5-VL with Same LLM ToenableafaircomparisonwithQwen2.5-VL,wetrainLLaVA-Onevision-1.5-3BbasedonQwen2.5- 3B-Instruct. As shown in Fig. 9, LLaVA-Onevision-1.5-3B also demonstrates superior performance, achieving better results on 17 out of 27 downstream benchmarks. Figure 9Comparison between LLaVA-OV-1.5-3B and Qwen2.5-VL-3B model ...

work page 2023