arxiv: 2508.11737 · v1 · submitted 2025-08-15 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Ovis2.5 Technical Report

Shiyin Lu , Yang Li , Yu Xia , Yuwei Hu , Shanshan Zhao , Yanqing Ma , Zhichao Wei , Yinglun Li

show 34 more authors

Lunhao Duan Jianshan Zhao Yuxuan Han Haijun Li Wanying Chen Junke Tang Chengkun Hou Zhixing Du Tianli Zhou Wenjie Zhang Huping Ding Jiahe Li Wen Li Gui Hu Yiliang Gu Siran Yang Jiamang Wang Hailong Sun Yibo Wang Hui Sun Jinlong Huang Yuping He Shengze Shi Weihong Zhang Guodong Zheng Junpeng Jiang Sensen Gao Yi-Feng Wu Sijia Chen Yuhui Chen Qing-Guo Chen Zhao Xu Weihua Luo Kaifu Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords Ovis2.5multimodal large language modelnative-resolution vision transformerreflection reasoningOpenCompass leaderboardstate-of-the-art MLLMchart analysissmall model performance

0 comments

The pith

Ovis2.5 processes images at native resolutions and adds reflection to reach 78.3 on the OpenCompass multimodal leaderboard.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ovis2.5 as a successor to Ovis2 that handles images at their original variable resolutions through a native-resolution vision transformer, avoiding detail loss from fixed tiling especially on dense content like charts. It further trains the model to perform reflection, including self-checking and revision of its reasoning steps, made available as an optional thinking mode during inference. These capabilities are developed through a five-phase curriculum that starts with pretraining and ends with alignment via DPO and GRPO. The resulting 9B and 2B models deliver leading scores on benchmarks for STEM, grounding, video, and chart tasks, demonstrating that targeted architectural and training choices can deliver high performance in smaller open-source multimodal models.

Core claim

Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, preserving fine detail and global layout, and trains the model to perform reflection including self-checking and revision beyond linear chain-of-thought, exposed as optional thinking mode, via a five-phase curriculum from foundational pretraining to reasoning enhancement with DPO and GRPO, achieving 78.3 average on OpenCompass for the 9B model and 73.9 for the 2B model, establishing SOTA among open-source MLLMs in the sub-40B range and for its size respectively.

What carries the argument

native-resolution vision transformer paired with a reflection mechanism for reasoning

Load-bearing premise

The benchmark gains stem primarily from the native-resolution vision transformer and reflection training rather than from differences in data volume, quality, or hyperparameter choices.

What would settle it

A controlled comparison retraining an otherwise identical model with fixed-resolution tiling and linear chain-of-thought on the same data curriculum, then measuring whether its OpenCompass average falls below 78.3.

read the original abstract

We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ovis2.5 adds native-resolution vision and a toggleable reflection mode that deliver measurable gains on charts and STEM, but the paper leaves the exact contribution of those features untested against data and training differences.

read the letter

Ovis2.5 introduces a native-resolution vision transformer that skips fixed tiling and a reflection step where the model checks and revises its own output. These are the concrete additions over Ovis2. The 9B version reaches 78.3 on OpenCompass and the 2B hits 73.9, with reported strengths on STEM, grounding, video, and complex charts. The training runs through five phases that end in DPO and GRPO, and the authors note efficiency gains from data packing and hybrid parallelism. Both models are released openly, which is useful for anyone who needs variable-resolution handling without heavy resizing artifacts.

Referee Report

2 major / 2 minor

Summary. The paper presents Ovis2.5, a successor to Ovis2 featuring a native-resolution vision transformer for variable-resolution image processing and a reflection mechanism (self-checking and revision) for enhanced multimodal reasoning beyond linear chain-of-thought. The models are trained via a five-phase curriculum progressing from pretraining to instruction tuning and alignment with DPO/GRPO, with efficiency gains from multimodal data packing and hybrid parallelism. Ovis2.5-9B reports an average of 78.3 on the OpenCompass multimodal leaderboard (SOTA among open-source sub-40B models) and Ovis2.5-2B reports 73.9 (SOTA for its size), with leading results on STEM, grounding, video, and chart tasks.

Significance. If the reported gains are robustly attributable to the native-resolution ViT and reflection components, the work would advance open-source MLLM capabilities, particularly by demonstrating strong performance in small models suitable for on-device use and by providing practical scaling techniques via data packing and parallelism.

major comments (2)

[Evaluation / Experiments] The evaluation section reports substantial gains over Ovis2-8B (78.3 vs. unspecified predecessor score) on OpenCompass without ablation studies that hold total training tokens, data mixture, and optimization steps fixed while isolating the native-resolution vision transformer and reflection objective; this leaves the attribution of improvements to the highlighted mechanisms unverified.
[Method / Training Curriculum] The method section describes the five-phase curriculum at a high level but provides no quantitative details on per-phase token counts, data composition, or the precise formulation of the reflection objective and GRPO application, which are load-bearing for assessing both the claimed efficiency and the source of performance gains.

minor comments (2)

[Abstract] The abstract states a 'substantial improvement over Ovis2-8B' but does not report the predecessor's exact OpenCompass score for direct quantitative context.
[Inference / Results] Include at least one concrete example (with latency and accuracy numbers) of the optional 'thinking mode' to illustrate the latency-accuracy trade-off at inference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our technical report. We address each major comment below, providing the strongest honest defense of the manuscript while committing to revisions where feasible.

read point-by-point responses

Referee: [Evaluation / Experiments] The evaluation section reports substantial gains over Ovis2-8B (78.3 vs. unspecified predecessor score) on OpenCompass without ablation studies that hold total training tokens, data mixture, and optimization steps fixed while isolating the native-resolution vision transformer and reflection objective; this leaves the attribution of improvements to the highlighted mechanisms unverified.

Authors: We agree that controlled ablations fixing total training tokens, data mixture, and optimization steps would provide stronger causal evidence for the contributions of the native-resolution ViT and reflection mechanism. Such experiments are computationally expensive at this scale and were not performed. The manuscript instead demonstrates overall system performance through direct comparison to the Ovis2-8B predecessor and other open-source MLLMs, with consistent gains on STEM, grounding, video, and chart tasks. We have now explicitly stated the Ovis2-8B OpenCompass score in the revised evaluation section for clarity. revision: partial
Referee: [Method / Training Curriculum] The method section describes the five-phase curriculum at a high level but provides no quantitative details on per-phase token counts, data composition, or the precise formulation of the reflection objective and GRPO application, which are load-bearing for assessing both the claimed efficiency and the source of performance gains.

Authors: We accept that the original description was high-level. In the revised manuscript we have expanded the training curriculum section to include per-phase token counts, data composition ratios, the mathematical formulation of the reflection objective (self-checking and revision steps), and the precise GRPO implementation details during the alignment phase. These additions improve reproducibility without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical performance claims or architecture descriptions

full rationale

The paper is a technical report describing an MLLM architecture, training curriculum, and benchmark results on external public leaderboards (OpenCompass). No mathematical derivations, equations, or first-principles predictions are present that could reduce to self-defined inputs. Performance numbers (78.3 and 73.9 averages) are measured outcomes against independent benchmarks rather than quantities fitted or renamed from the authors' own parameters. Architectural choices (native-resolution ViT, reflection mode) and the five-phase curriculum are presented as design decisions, not derived via self-citation chains or ansatzes that loop back. The report is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard transformer scaling assumptions, the validity of OpenCompass and other public benchmarks as proxies for capability, and the effectiveness of DPO/GRPO for alignment; no new physical or mathematical axioms are introduced.

free parameters (2)

model parameter counts (9B, 2B)
Chosen sizes that determine compute and performance trade-offs.
five-phase curriculum hyperparameters
Learning rates, data volumes, and phase durations tuned during training.

axioms (2)

domain assumption Public multimodal benchmarks accurately reflect real-world visual reasoning performance.
Invoked when claiming SOTA status from leaderboard averages.
domain assumption Reflection improves accuracy on difficult inputs without introducing new failure modes.
Assumed when exposing thinking mode as an optional accuracy-latency trade-off.

pith-pipeline@v0.9.0 · 5809 in / 1490 out tokens · 31171 ms · 2026-05-15T20:26:58.365351+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
cs.CV 2026-05 unverdicted novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
From Table to Cell: Attention for Better Reasoning with TABALIGN
cs.AI 2026-05 unverdicted novelty 7.0

TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
cs.CV 2026-05 unverdicted novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
cs.CV 2026-05 unverdicted novelty 7.0

EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering
cs.CL 2026-04 unverdicted novelty 7.0

TopBench is a new benchmark exposing that LLMs default to table lookups instead of intent-aware predictive reasoning on tabular data, with intent disambiguation as a key prerequisite.
FCMBench-Video: Benchmarking Document Video Intelligence
cs.CV 2026-04 unverdicted novelty 7.0

FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning
cs.RO 2026-03 unverdicted novelty 6.0

BrainMem equips LLM-based embodied planners with working, episodic, and semantic memory that evolves interaction histories into retrievable knowledge graphs and guidelines, raising success rates on long-horizon 3D benchmarks.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 unverdicted novelty 6.0

MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
cs.AI 2026-05 unverdicted novelty 5.0

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
cs.LG 2025-09 unverdicted novelty 5.0

An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.