hub

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui · 2024 · cs.CV · arXiv 2412.05271

97 Pith papers cite this work. Polarity classification is still indexing.

97 Pith papers citing it

open full Pith review browse 97 citing papers arXiv PDF

abstract

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1

citation-polarity summary

background 3 baseline 1

claims ledger

abstract We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, mult

co-cited works

representative citing papers

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

cs.CV · 2026-05-02 · unverdicted · novelty 8.0

VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

cs.CV · 2026-05-13 · conditional · novelty 7.0

SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annotated frames.

Count Anything at Any Granularity

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

cs.CV · 2026-05-11 · conditional · novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.

GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

GEASS selectively gates and weights self-generated captions using confidence and entropy to reduce object hallucinations in VLMs, outperforming vanilla inference and contrastive decoding on POPE and HallusionBench.

Act2See: Emergent Active Visual Perception for Video Reasoning

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.

Membership Inference Attacks Against Video Large Language Models

cs.CR · 2026-04-29 · unverdicted · novelty 7.0

A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.

DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

Vision-language models achieve at most 61.9% accuracy on identifying image distortion types and severities, falling short of human majority-vote performance at 65.7%.

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.

S-GRPO: Unified Post-Training for Large Vision-Language Models

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results across VLMs and benchmarks.

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

cs.SE · 2026-04-12 · unverdicted · novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding

cs.MA · 2026-04-09 · unverdicted · novelty 7.0

Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.

ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

cs.CV · 2026-04-04 · unverdicted · novelty 7.0

Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.

citing papers explorer

Showing 50 of 97 citing papers.

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark cs.CV · 2026-05-02 · unverdicted · none · ref 17 · internal anchor
VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs cs.CV · 2026-05-13 · conditional · none · ref 6 · internal anchor
SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annotated frames.
Count Anything at Any Granularity cs.CV · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation cs.CV · 2026-05-11 · conditional · none · ref 50 · internal anchor
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning cs.CV · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs cs.CV · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA cs.CV · 2026-05-06 · unverdicted · none · ref 4 · internal anchor
VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
GEASS: Training-Free Caption Steering for Hallucination Mitigation in Vision-Language Models cs.CV · 2026-05-03 · unverdicted · none · ref 2 · internal anchor
GEASS selectively gates and weights self-generated captions using confidence and entropy to reduce object hallucinations in VLMs, outperforming vanilla inference and contrastive decoding on POPE and HallusionBench.
Act2See: Emergent Active Visual Perception for Video Reasoning cs.CV · 2026-05-03 · unverdicted · none · ref 6 · internal anchor
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
Membership Inference Attacks Against Video Large Language Models cs.CR · 2026-04-29 · unverdicted · none · ref 10 · internal anchor
A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 14 · internal anchor
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments cs.CV · 2026-04-24 · unverdicted · none · ref 5 · internal anchor
SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving cs.CL · 2026-04-23 · unverdicted · none · ref 90 · internal anchor
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
DistortBench: Benchmarking Vision Language Models on Image Distortion Identification cs.CV · 2026-04-21 · unverdicted · none · ref 3 · internal anchor
Vision-language models achieve at most 61.9% accuracy on identifying image distortion types and severities, falling short of human majority-vote performance at 65.7%.
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models cs.CV · 2026-04-18 · unverdicted · none · ref 5 · internal anchor
DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
S-GRPO: Unified Post-Training for Large Vision-Language Models cs.LG · 2026-04-17 · unverdicted · none · ref 7 · internal anchor
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models cs.CV · 2026-04-16 · unverdicted · none · ref 8 · internal anchor
VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results across VLMs and benchmarks.
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 10 · internal anchor
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search cs.SE · 2026-04-12 · unverdicted · none · ref 9 · internal anchor
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 8 · internal anchor
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding cs.MA · 2026-04-09 · unverdicted · none · ref 9 · internal anchor
Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference cs.CV · 2026-04-07 · unverdicted · none · ref 8 · internal anchor
ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 6 · internal anchor
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models cs.CV · 2026-04-04 · unverdicted · none · ref 3 · internal anchor
Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators cs.CV · 2026-03-31 · unverdicted · none · ref 3 · internal anchor
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark cs.CV · 2026-03-28 · unverdicted · none · ref 5 · internal anchor
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning cs.CV · 2025-05-20 · unverdicted · none · ref 6 · internal anchor
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 4 · internal anchor
SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 45 · internal anchor
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer? cs.AI · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
LatentRouter routes image-question queries to the best MLLM by predicting counterfactual performance via latent communication between learned query capsules and model capability tokens.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 7 · internal anchor
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric cs.AI · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
VL-LCM measures vision-language logical consistency without annotations and shows that recent MLLMs have high accuracy but low logical consistency on benchmarks like MMMU and NaturalBench.
From Priors to Perception: Grounding Video-LLMs in Physical Reality cs.CV · 2026-05-06 · unverdicted · none · ref 9 · internal anchor
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting cs.AI · 2026-05-05 · unverdicted · none · ref 56 · internal anchor
ScrapMem introduces optical forgetting to compress multimodal memories for LLM agents on edge devices, cutting storage by up to 93% while reaching 51.0% Joint@10 and 70.3% Recall@10 on ATM-Bench.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding cs.AI · 2026-05-04 · unverdicted · none · ref 17 · internal anchor
CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding cs.AI · 2026-05-04 · unverdicted · none · ref 12 · internal anchor
CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-correction in speculative decoding.
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts cs.CV · 2026-05-03 · unverdicted · none · ref 5 · internal anchor
Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems cs.AI · 2026-05-03 · unverdicted · none · ref 5 · 2 links · internal anchor
A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection rates from over 95% to under 5.47% while preserving benign performance.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CV · 2026-04-29 · unverdicted · none · ref 21 · internal anchor
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs cs.CV · 2026-04-23 · unverdicted · none · ref 8 · internal anchor
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
Long-Horizon Manipulation via Trace-Conditioned VLA Planning cs.RO · 2026-04-23 · unverdicted · none · ref 11 · internal anchor
LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.
Latent Denoising Improves Visual Alignment in Large Multimodal Models cs.CV · 2026-04-23 · unverdicted · none · ref 12 · internal anchor
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
Exploring High-Order Self-Similarity for Video Understanding cs.CV · 2026-04-22 · unverdicted · none · ref 11 · internal anchor
The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization cs.AI · 2026-04-22 · unverdicted · none · ref 6 · internal anchor
V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 7 · internal anchor
ScanVLA uses a vision-language model with a history-enhanced decoder and frozen segmentation LoRA to outperform prior methods on object-referring scanpath prediction.
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving cs.CL · 2026-04-22 · unverdicted · none · ref 74 · internal anchor
DCM-Agent improves LLM performance on multi-paradigm optimization problems by 11-21% via dual-cluster memory construction and dynamic inference guidance.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data cs.LG · 2026-04-20 · unverdicted · none · ref 20 · internal anchor
A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
MetaEarth3D: Unlocking World-scale 3D Generation with Spatially Scalable Generative Modeling cs.CV · 2026-04-19 · unverdicted · none · ref 8 · internal anchor
MetaEarth3D is the first generative foundation model for spatially consistent, unbounded 3D scene generation at planetary scale using optical Earth observation data.
PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging cs.CV · 2026-04-18 · unverdicted · none · ref 26 · internal anchor
PivotMerge merges heterogeneous multimodal pre-trained models via shared-space decomposition to filter conflicts and layer-wise weights based on alignment contributions, outperforming baselines on multimodal benchmarks.
Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization cs.CV · 2026-04-17 · unverdicted · none · ref 8 · internal anchor
Vision-language models display large performance differences and clear limits in zero-shot country-level geolocalization from ground-view photos, with semantic cues helping coarse guesses but failing on fine details.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer