TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
super hub
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
142 Pith papers cite this work. Polarity classification is still indexing.
abstract
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further
authors
co-cited works
roles
baseline 1polarities
baseline 1representative citing papers
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
VLMs over-correct multi-line handwritten math OCR, and the PINK metric using LLM rubric grading penalizes this for better human alignment.
G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.
BenchCAD benchmark shows frontier multimodal models recover coarse geometry but fail to produce accurate parametric CAD programs for industrial parts, with limited generalization after fine-tuning.
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
State-conditioned commitment depth in a vision-language policy Pareto-dominates fixed-depth baselines on Sliding Puzzle and Sokoban, raising solve rates by up to 12.5 points while using 25% fewer actions and beating larger models.
STEMO-Bench evaluates intermediate spatio-temporal reasoning in video MLLMs via object-centric facts, and STEMO-Track improves consistency by chunk-wise trajectory construction and aggregation.
SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frechet mean centering, holistic scene attention, and spherical geodesic pulling.
GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
PolarVLM is the first VLM framework to integrate polarimetric physical parameters via dual-stream architecture and progressive training, delivering 25.4% gains over RGB baselines on reflection and transparency tasks with a new 75K-pair PolarVQA benchmark.
ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
SRPO refines GRPO into role-aware token-level advantages by emphasizing perception tokens based on visual dependency (original vs. corrupted inputs) and reasoning tokens based on consistency with perception, unified via a shared baseline.
Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks QFSD and AgriInsect.
VT-Bench is the first unified benchmark aggregating 14 visual-tabular datasets with over 756K samples and evaluating 23 models to expose challenges in this multi-modal area.
GEASS selectively gates and weights self-generated captions using confidence and entropy to reduce object hallucinations in VLMs, outperforming vanilla inference and contrastive decoding on POPE and HallusionBench.
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
citing papers explorer
-
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.