super hub Mixed citations

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Erfei Cui, Weiyun Wang, Yangzhou Liu, Yue Cao, Zhangwei Gao, Zhe Chen · 2024 · cs.CV · arXiv 2412.05271

Mixed citation behavior. Most common role is background (55%).

183 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 183 citing papers more from Erfei Cui arXiv PDF

abstract

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 29 baseline 18 method 4 dataset 1 other 1

citation-polarity summary

background 29 baseline 18 use method 5 unclear 1

claims ledger

abstract We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, mult

authors

Erfei Cui Weiyun Wang Yangzhou Liu Yue Cao Zhangwei Gao Zhe Chen

co-cited works

representative citing papers

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

cs.CV · 2026-05-20 · conditional · novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

cs.CV · 2026-05-19 · conditional · novelty 7.0

HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

cs.CV · 2026-05-13 · conditional · novelty 7.0

SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annotated frames.

Count Anything at Any Granularity

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

cs.CV · 2026-05-11 · conditional · novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.

Act2See: Emergent Active Visual Perception for Video Reasoning

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.

Membership Inference Attacks Against Video Large Language Models

cs.CR · 2026-04-29 · unverdicted · novelty 7.0

A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.

DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

Vision-language models achieve at most 61.9% accuracy on identifying image distortion types and severities, falling short of human majority-vote performance at 65.7%.

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.

S-GRPO: Unified Post-Training for Large Vision-Language Models

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results across VLMs and benchmarks.

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

cs.SE · 2026-04-12 · unverdicted · novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding

cs.MA · 2026-04-09 · unverdicted · novelty 7.0

Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.

citing papers explorer

Showing 3 of 3 citing papers after filters.

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability cs.DC · 2026-05-18 · unverdicted · none · ref 8 · internal anchor
RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.
Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing cs.DC · 2026-05-18 · unverdicted · none · ref 7 · internal anchor
Apollo uses temporal-spatial multiplexing and a performance model to let multiple multimodal model modules share GPUs, delivering up to 1.31x training speedup in testbed experiments.
DFLOP: A Data-driven Framework for Multimodal LLM Training Pipeline Optimization cs.DC · 2026-03-26 · unverdicted · none · ref 8 · internal anchor
DFLOP is a data-driven framework that profiles data-induced computation variance and uses predictive scheduling to balance workloads in multimodal LLM training pipelines, claiming up to 3.6x faster training than existing frameworks.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer