Recognition: 3 theorem links
· Lean TheoremOvis2.5 Technical Report
Pith reviewed 2026-05-15 20:26 UTC · model grok-4.3
The pith
Ovis2.5 processes images at native resolutions and adds reflection to reach 78.3 on the OpenCompass multimodal leaderboard.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, preserving fine detail and global layout, and trains the model to perform reflection including self-checking and revision beyond linear chain-of-thought, exposed as optional thinking mode, via a five-phase curriculum from foundational pretraining to reasoning enhancement with DPO and GRPO, achieving 78.3 average on OpenCompass for the 9B model and 73.9 for the 2B model, establishing SOTA among open-source MLLMs in the sub-40B range and for its size respectively.
What carries the argument
native-resolution vision transformer paired with a reflection mechanism for reasoning
Load-bearing premise
The benchmark gains stem primarily from the native-resolution vision transformer and reflection training rather than from differences in data volume, quality, or hyperparameter choices.
What would settle it
A controlled comparison retraining an otherwise identical model with fixed-resolution tiling and linear chain-of-thought on the same data curriculum, then measuring whether its OpenCompass average falls below 78.3.
read the original abstract
We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Ovis2.5, a successor to Ovis2 featuring a native-resolution vision transformer for variable-resolution image processing and a reflection mechanism (self-checking and revision) for enhanced multimodal reasoning beyond linear chain-of-thought. The models are trained via a five-phase curriculum progressing from pretraining to instruction tuning and alignment with DPO/GRPO, with efficiency gains from multimodal data packing and hybrid parallelism. Ovis2.5-9B reports an average of 78.3 on the OpenCompass multimodal leaderboard (SOTA among open-source sub-40B models) and Ovis2.5-2B reports 73.9 (SOTA for its size), with leading results on STEM, grounding, video, and chart tasks.
Significance. If the reported gains are robustly attributable to the native-resolution ViT and reflection components, the work would advance open-source MLLM capabilities, particularly by demonstrating strong performance in small models suitable for on-device use and by providing practical scaling techniques via data packing and parallelism.
major comments (2)
- [Evaluation / Experiments] The evaluation section reports substantial gains over Ovis2-8B (78.3 vs. unspecified predecessor score) on OpenCompass without ablation studies that hold total training tokens, data mixture, and optimization steps fixed while isolating the native-resolution vision transformer and reflection objective; this leaves the attribution of improvements to the highlighted mechanisms unverified.
- [Method / Training Curriculum] The method section describes the five-phase curriculum at a high level but provides no quantitative details on per-phase token counts, data composition, or the precise formulation of the reflection objective and GRPO application, which are load-bearing for assessing both the claimed efficiency and the source of performance gains.
minor comments (2)
- [Abstract] The abstract states a 'substantial improvement over Ovis2-8B' but does not report the predecessor's exact OpenCompass score for direct quantitative context.
- [Inference / Results] Include at least one concrete example (with latency and accuracy numbers) of the optional 'thinking mode' to illustrate the latency-accuracy trade-off at inference.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our technical report. We address each major comment below, providing the strongest honest defense of the manuscript while committing to revisions where feasible.
read point-by-point responses
-
Referee: [Evaluation / Experiments] The evaluation section reports substantial gains over Ovis2-8B (78.3 vs. unspecified predecessor score) on OpenCompass without ablation studies that hold total training tokens, data mixture, and optimization steps fixed while isolating the native-resolution vision transformer and reflection objective; this leaves the attribution of improvements to the highlighted mechanisms unverified.
Authors: We agree that controlled ablations fixing total training tokens, data mixture, and optimization steps would provide stronger causal evidence for the contributions of the native-resolution ViT and reflection mechanism. Such experiments are computationally expensive at this scale and were not performed. The manuscript instead demonstrates overall system performance through direct comparison to the Ovis2-8B predecessor and other open-source MLLMs, with consistent gains on STEM, grounding, video, and chart tasks. We have now explicitly stated the Ovis2-8B OpenCompass score in the revised evaluation section for clarity. revision: partial
-
Referee: [Method / Training Curriculum] The method section describes the five-phase curriculum at a high level but provides no quantitative details on per-phase token counts, data composition, or the precise formulation of the reflection objective and GRPO application, which are load-bearing for assessing both the claimed efficiency and the source of performance gains.
Authors: We accept that the original description was high-level. In the revised manuscript we have expanded the training curriculum section to include per-phase token counts, data composition ratios, the mathematical formulation of the reflection objective (self-checking and revision steps), and the precise GRPO implementation details during the alignment phase. These additions improve reproducibility without altering the core claims. revision: yes
Circularity Check
No circularity in empirical performance claims or architecture descriptions
full rationale
The paper is a technical report describing an MLLM architecture, training curriculum, and benchmark results on external public leaderboards (OpenCompass). No mathematical derivations, equations, or first-principles predictions are present that could reduce to self-defined inputs. Performance numbers (78.3 and 73.9 averages) are measured outcomes against independent benchmarks rather than quantities fitted or renamed from the authors' own parameters. Architectural choices (native-resolution ViT, reflection mode) and the five-phase curriculum are presented as design decisions, not derived via self-citation chains or ansatzes that loop back. The report is self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- model parameter counts (9B, 2B)
- five-phase curriculum hyperparameters
axioms (2)
- domain assumption Public multimodal benchmarks accurately reflect real-world visual reasoning performance.
- domain assumption Reflection improves accuracy on difficult inputs without introducing new failure modes.
Forward citations
Cited by 19 Pith papers
-
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
-
From Table to Cell: Attention for Better Reasoning with TABALIGN
TABALIGN pairs a diffusion language model planner emitting binary cell masks with a trained attention verifier, raising average accuracy 15.76 points over strong baselines on eight table benchmarks while speeding exec...
-
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
-
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
-
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
-
TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering
TopBench is a new benchmark exposing that LLMs default to table lookups instead of intent-aware predictive reasoning on tabular data, with intent disambiguation as a key prerequisite.
-
FCMBench-Video: Benchmarking Document Video Intelligence
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
-
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale
A fixed 1.2B model trained via diversity-aware sampling, cross-model verification, annotation refinement, and progressive stages achieves new state-of-the-art document parsing accuracy of 95.69 on OmniDocBench v1.6.
-
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models
SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
-
ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning
ReasonEdit uses a new CoT dataset and reinforcement learning to produce interpretable, human-aligned evaluations of text-guided image edits.
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
ITIScore evaluates MLLM image captions via image-to-text-to-image reconstruction consistency and aligns with human judgments on a new 40K-caption benchmark.
-
EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs
EgoMind activates spatial cognition in MLLMs via linguistic Role-Play Caption and Progressive Spatial Analysis, reaching competitive results on VSI-Bench, SPAR-Bench, SITE-Bench and SPBench with only 5K SFT and 20K RL...
-
BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning
BrainMem equips LLM-based embodied planners with working, episodic, and semantic memory that evolves interaction histories into retrievable knowledge graphs and guidelines, raising success rates on long-horizon 3D benchmarks.
-
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
-
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
-
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.