DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
super hub Mixed citations
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Mixed citation behavior. Most common role is background (56%).
abstract
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, mult
authors
co-cited works
representative citing papers
VLMs show chance-level depth ordering performance (47-56%) on controlled images, driven by language bias rather than pictorial cues, with no improvement from CoT or ICL.
EgoGapBench shows humans reliably select egocentric actions in multi-agent scenes while MLLMs systematically choose other agents' actions, and standard egocentric training data fails to close the gap.
MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.
MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.
Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.
Introduces the FeVOS task, a 968-clip dataset with foresight expressions, and an MLLM model FeVOS-R1 trained via SFT then RL that reports SOTA on the new task plus generalization to prior RVOS benchmarks.
Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.
Remote sensing MLLMs perform poorly on negation tasks with hallucinations and accuracy drops, but the NeFo test-time learning method substantially improves negation understanding and generalizes to unseen tasks using ~5% unlabeled test samples.
Earth-OneVision is a unified 2B-parameter RS-MLLM supporting six modalities and nine tasks via FGVLA, SLIS, and PCMA mechanisms plus a 34M QA-pair dataset, reporting competitive or superior benchmark results versus larger models.
TVI-CoT introduces learnable control tokens <THINK>, <LOOK>, <ANSWER> that let multimodal LLMs interleave textual reasoning with dynamic visual feature access, reporting gains of 3.4-6.1% on eight benchmarks over prior CoT baselines.
Iterative solvers in layer-wise model merging act as spectral regularizers on an ill-posed interference operator; closed-form SWUDI and adaptive SWUDI-A match or exceed SOTA merging accuracy with 28-72x wall-clock speedup.
SVHighlights is the first benchmark for highlight detection in hour-long sports videos, with TF-SELECTOR showing that segment-level LLM scoring outperforms adapted short-video baselines by 2.5-4 points on key metrics.
DisasterBench is a new multi-stage multimodal reasoning benchmark for UAV disaster response with 14 scenes and 9 tasks; the accompanying 2B DisasterVL model outperforms open-source MLLMs and approaches GPT-4o efficiency.
Authors create ReasonMatch-Bench and DCRL training to boost MLLM performance on wide-baseline matching, reporting gains over baselines while preserving general capabilities.
Attentive-CoT is an attention-guided fine-tuning objective that improves chain-of-thought performance in multimodal LLMs by delaying answer commitment and increasing sustained visual-token access during rationale generation.
ERGeoBench is a new diagnostic benchmark evaluating MLLMs on four capabilities in three progressive embodied geo-localization settings, finding that models handle high-level semantics but struggle with fine-grained perception and metric localization.
Introduces SANSA paradigm for semantic-agnostic vision-language segmentation via dictionary or example-based prompts, with finetuning delivering up to 20% mIoU gains on the new task while retaining standard performance.
Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.
OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.
CaST-Bench creates a benchmark with causal-chain annotations and novel metrics showing that current VLMs struggle to construct precise grounded causal chains in video QA.
SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.
WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.
citing papers explorer
-
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models
Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.
-
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
-
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
-
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
-
From Priors to Perception: Grounding Video-LLMs in Physical Reality
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
-
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
-
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
-
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-language models.
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding and generation via an MLLM-MMDiT architecture with spatial training signals to reach competitive benchmark performance and stronger spatial intelligence.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.