super hub Mixed citations

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Erfei Cui, Weiyun Wang, Yangzhou Liu, Yue Cao, Zhangwei Gao, Zhe Chen · 2024 · cs.CV · arXiv 2412.05271

Mixed citation behavior. Most common role is background (55%).

183 Pith papers citing it

Background 55% of classified citations

open full Pith review browse 183 citing papers more from Erfei Cui arXiv PDF

abstract

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 29 baseline 18 method 4 dataset 1 other 1

citation-polarity summary

background 29 baseline 18 use method 5 unclear 1

claims ledger

abstract We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, mult

authors

Erfei Cui Weiyun Wang Yangzhou Liu Yue Cao Zhangwei Gao Zhe Chen

co-cited works

representative citing papers

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

cs.CV · 2026-05-20 · conditional · novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

cs.CV · 2026-05-19 · conditional · novelty 7.0

HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

cs.CV · 2026-05-13 · conditional · novelty 7.0

SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annotated frames.

Count Anything at Any Granularity

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

cs.CV · 2026-05-11 · conditional · novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.

Act2See: Emergent Active Visual Perception for Video Reasoning

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.

Membership Inference Attacks Against Video Large Language Models

cs.CR · 2026-04-29 · unverdicted · novelty 7.0

A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.

DistortBench: Benchmarking Vision Language Models on Image Distortion Identification

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

Vision-language models achieve at most 61.9% accuracy on identifying image distortion types and severities, falling short of human majority-vote performance at 65.7%.

DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.

S-GRPO: Unified Post-Training for Large Vision-Language Models

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results across VLMs and benchmarks.

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

cs.SE · 2026-04-12 · unverdicted · novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

Open-Ended Video Game Glitch Detection with Agentic Reasoning and Temporal Grounding

cs.MA · 2026-04-09 · unverdicted · novelty 7.0

Introduces the first benchmark for open-ended video game glitch detection with temporal localization and proposes GliDe, an agentic framework that achieves stronger performance than vanilla multimodal models.

citing papers explorer

Showing 50 of 134 citing papers after filters.

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering cs.CV · 2026-05-22 · unverdicted · none · ref 6 · internal anchor
Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals cs.CV · 2026-05-21 · unverdicted · none · ref 12 · internal anchor
SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata cs.CV · 2026-05-20 · conditional · none · ref 136 · internal anchor
WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation cs.CV · 2026-05-19 · conditional · none · ref 3 · internal anchor
HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 6 · internal anchor
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation cs.CV · 2026-05-16 · unverdicted · none · ref 6 · internal anchor
HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs cs.CV · 2026-05-13 · conditional · none · ref 6 · internal anchor
SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annotated frames.
Count Anything at Any Granularity cs.CV · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation cs.CV · 2026-05-11 · conditional · none · ref 50 · internal anchor
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning cs.CV · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs cs.CV · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA cs.CV · 2026-05-06 · unverdicted · none · ref 4 · internal anchor
VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
Act2See: Emergent Active Visual Perception for Video Reasoning cs.CV · 2026-05-03 · unverdicted · none · ref 6 · internal anchor
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 14 · internal anchor
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
DistortBench: Benchmarking Vision Language Models on Image Distortion Identification cs.CV · 2026-04-21 · unverdicted · none · ref 3 · internal anchor
Vision-language models achieve at most 61.9% accuracy on identifying image distortion types and severities, falling short of human majority-vote performance at 65.7%.
DO-Bench: An Attributable Benchmark for Diagnosing Object Hallucination in Vision-Language Models cs.CV · 2026-04-18 · unverdicted · none · ref 5 · internal anchor
DO-Bench is a controlled benchmark that attributes VLM object hallucination errors to textual prior pressure, perceptual limits, or their interaction via two diagnostic dimensions and metrics.
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models cs.CV · 2026-04-16 · unverdicted · none · ref 8 · internal anchor
VisPCO uses continuous relaxation, straight-through estimators, and budget-aware Pareto-frontier learning to automatically discover optimal visual token pruning configurations that approximate grid-search results across VLMs and benchmarks.
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 10 · internal anchor
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 8 · internal anchor
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
ID-Selection: Importance-Diversity Based Visual Token Selection for Efficient LVLM Inference cs.CV · 2026-04-07 · unverdicted · none · ref 8 · internal anchor
ID-Selection combines importance scoring with iterative diversity suppression to prune 97.2% of visual tokens in LVLMs while retaining 91.8% performance and cutting FLOPs by over 97% without retraining.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 6 · internal anchor
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models cs.CV · 2026-04-04 · unverdicted · none · ref 3 · internal anchor
Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators cs.CV · 2026-03-31 · unverdicted · none · ref 3 · internal anchor
V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the fine-grained perception gap on benchmarks.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark cs.CV · 2026-03-28 · unverdicted · none · ref 5 · internal anchor
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding cs.CV · 2026-02-24 · unverdicted · none · ref 6 · internal anchor
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning cs.CV · 2026-01-30 · unverdicted · none · ref 6 · internal anchor
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
A Unified and Controllable Framework for Layered Image Generation with Visual Effects cs.CV · 2026-01-21 · unverdicted · none · ref 7 · internal anchor
LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation cs.CV · 2025-12-18 · unverdicted · none · ref 72 · internal anchor
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CV · 2025-07-08 · conditional · none · ref 2 · internal anchor
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models cs.CV · 2025-06-10 · unverdicted · none · ref 10 · internal anchor
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? cs.CV · 2025-05-27 · conditional · none · ref 45 · internal anchor
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence cs.CV · 2025-05-22 · conditional · none · ref 15 · internal anchor
Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning cs.CV · 2025-05-20 · unverdicted · none · ref 6 · internal anchor
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR cs.CV · 2025-04-15 · conditional · none · ref 8 · internal anchor
Consensus Entropy measures inter-VLM output agreement to verify OCR reliability and enable self-improving ensembles, yielding 42.1% F1 gains over single-model judging.
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning cs.CV · 2025-04-02 · unverdicted · none · ref 3 · internal anchor
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning cs.CV · 2025-03-10 · unverdicted · none · ref 9 · internal anchor
AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 5 · internal anchor
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning cs.CV · 2024-12-31 · accept · none · ref 23 · internal anchor
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation cs.CV · 2026-05-19 · unverdicted · none · ref 10 · internal anchor
FullFlow adds LoRA adapters and discrete text insertion to pretrained rectified-flow text-to-image models, achieving bidirectional generation with major gains in FID, CIDEr, VRAM, and throughput over Dual Diffusion baselines.
JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA cs.CV · 2026-05-19 · unverdicted · none · ref 17 · internal anchor
JUDO enhances large multimodal models for industrial anomaly QA by juxtaposing query images with normal ones for visual comparison and using SFT plus GRPO with tailored rewards to inject domain knowledge, outperforming Qwen2.5-VL-7B and GPT-4o on the MMAD benchmark.
WOW-Seg: A Word-free Open World Segmentation Model cs.CV · 2026-05-16 · conditional · none · ref 3 · internal anchor
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers cs.CV · 2026-05-16 · unverdicted · none · ref 14 · internal anchor
DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding cs.CV · 2026-05-15 · unverdicted · none · ref 13 · internal anchor
A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
SceneGraphVLM: Dynamic Scene Graph Generation from Video with Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 4 · internal anchor
SceneGraphVLM generates dynamic scene graphs from video using compact VLMs, TOON serialization, and hallucination-aware RL to improve precision and achieve one-second latency.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 45 · internal anchor
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 7 · internal anchor
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
From Priors to Perception: Grounding Video-LLMs in Physical Reality cs.CV · 2026-05-06 · unverdicted · none · ref 9 · internal anchor
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts cs.CV · 2026-05-03 · unverdicted · none · ref 5 · internal anchor
Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CV · 2026-04-29 · unverdicted · none · ref 21 · internal anchor
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs cs.CV · 2026-04-23 · unverdicted · none · ref 8 · internal anchor
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer