super hub Mixed citations

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Erfei Cui, Weiyun Wang, Yangzhou Liu, Yue Cao, Zhangwei Gao, Zhe Chen · 2024 · cs.CV · arXiv 2412.05271

Mixed citation behavior. Most common role is background (56%).

256 Pith papers citing it

Background 56% of classified citations

open full Pith review browse 256 citing papers more from Erfei Cui arXiv PDF

abstract

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see https://huggingface.co/spaces/OpenGVLab/InternVL

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 baseline 18 method 4 dataset 1 other 1

citation-polarity summary

background 30 baseline 18 use method 5 unclear 1

claims ledger

abstract We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, mult

authors

Erfei Cui Weiyun Wang Yangzhou Liu Yue Cao Zhangwei Gao Zhe Chen

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

Disentangling Pictorial Cue Understanding from Language Bias in VLMs via Depth Ordering Task

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

VLMs show chance-level depth ordering performance (47-56%) on controlled images, driven by language bias rather than pictorial cues, with no improvement from CoT or ICL.

EgoGapBench: Benchmarking Egocentric Action Selection in Multi-Agent Scenes

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

EgoGapBench shows humans reliably select egocentric actions in multi-agent scenes while MLLMs systematically choose other agents' actions, and standard egocentric training data fails to close the gap.

Learning to Deny: Action Denial in Multimodal Large Language Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

Remote sensing MLLMs perform poorly on negation tasks with hallucinations and accuracy drops, but the NeFo test-time learning method substantially improves negation understanding and generalizes to unseen tasks using ~5% unlabeled test samples.

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

cs.CV · 2026-06-09 · unverdicted · novelty 7.0

Earth-OneVision is a unified 2B-parameter RS-MLLM supporting six modalities and nine tasks via FGVLA, SLIS, and PCMA mechanisms plus a 34M QA-pair dataset, reporting competitive or superior benchmark results versus larger models.

TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

TVI-CoT introduces learnable control tokens <THINK>, <LOOK>, <ANSWER> that let multimodal LLMs interleave textual reasoning with dynamic visual feature access, reporting gains of 3.4-6.1% on eight benchmarks over prior CoT baselines.

Closed-Form Spectral Regularization for Multi-Task Model Merging

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

Iterative solvers in layer-wise model merging act as spectral regularizers on an ill-posed interference operator; closed-form SWUDI and adaptive SWUDI-A match or exceed SOTA merging accuracy with 28-72x wall-clock speedup.

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

SVHighlights is the first benchmark for highlight detection in hour-long sports videos, with TF-SELECTOR showing that segment-level LLM scoring outperforms adapted short-video baselines by 2.5-4 points on key metrics.

DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

DisasterBench is a new multi-stage multimodal reasoning benchmark for UAV disaster response with 14 scenes and 9 tasks; the accompanying 2B DisasterVL model outperforms open-source MLLMs and approaches GPT-4o efficiency.

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Authors create ReasonMatch-Bench and DCRL training to boost MLLM performance on wide-baseline matching, reporting gains over baselines while preserving general capabilities.

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

Attentive-CoT is an attention-guided fine-tuning objective that improves chain-of-thought performance in multimodal LLMs by delaying answer commitment and increasing sustained visual-token access during rationale generation.

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

cs.CV · 2026-05-29 · accept · novelty 7.0

ERGeoBench is a new diagnostic benchmark evaluating MLLMs on four capabilities in three progressive embodied geo-localization settings, finding that models handle high-level semantics but struggle with fine-grained perception and metric localization.

Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

cs.CV · 2026-05-27 · unverdicted · novelty 7.0

Introduces SANSA paradigm for semantic-agnostic vision-language segmentation via dictionary or example-based prompts, with finetuning delivering up to 20% mIoU gains on the new task while retaining standard performance.

Touch-R1: Reinforcing Touch Reasoning in MLLMs

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

Touch-R1 applies GRPO reinforcement learning on a new 1M tactile dataset and benchmark to train a Qwen2.5-VL-7B model that outperforms baselines on tactile perception and visual-tactile conflict tasks.

Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

OpenRef benchmark for open-world REC with F1 and N3R metrics and training-free MCC to improve existing models in complex scenarios.

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

cs.CV · 2026-05-22 · unverdicted · novelty 7.0 · 2 refs

CaST-Bench creates a benchmark with causal-chain annotations and novel metrics showing that current VLMs struggle to construct precise grounded causal chains in video QA.

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

SDGBiasBench reveals intrinsic SDG biases in VLMs driven by priors rather than evidence, and CADE mitigates them with up to 25% accuracy gains and 12-point MAE reductions.

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

cs.CV · 2026-05-20 · conditional · novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

cs.CV · 2026-05-19 · conditional · novelty 7.0

HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

citing papers explorer

Showing 50 of 256 citing papers.

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding cs.CV · 2026-02-24 · unverdicted · none · ref 6 · internal anchor
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs cs.RO · 2026-02-09 · unverdicted · none · ref 28 · internal anchor
ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning cs.CV · 2026-01-30 · unverdicted · none · ref 6 · internal anchor
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
A Unified and Controllable Framework for Layered Image Generation with Visual Effects cs.CV · 2026-01-21 · unverdicted · none · ref 7 · internal anchor
LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation cs.CV · 2025-12-18 · unverdicted · none · ref 72 · internal anchor
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines cs.AI · 2025-08-08 · accept · none · ref 8 · internal anchor
GeoLaux is a new benchmark of 2186 long-step geometry problems requiring auxiliary lines, used to evaluate 23 MLLMs and reveal major drops in performance on complex tasks.
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CV · 2025-07-08 · conditional · none · ref 2 · internal anchor
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops cs.CL · 2025-06-17 · conditional · none · ref 8 · internal anchor
LingoLoop traps MLLMs into generating up to 367 times more tokens by applying POS-aware attention adjustments to postpone EOS tokens and pruning generative paths to sustain repetitive loops.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models cs.CV · 2025-06-10 · unverdicted · none · ref 10 · internal anchor
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs cs.CL · 2025-06-08 · unverdicted · none · ref 10 · internal anchor
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? cs.CV · 2025-05-27 · conditional · none · ref 45 · internal anchor
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence cs.CV · 2025-05-22 · conditional · none · ref 15 · internal anchor
Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning cs.CV · 2025-05-20 · unverdicted · none · ref 6 · internal anchor
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR cs.CV · 2025-04-15 · conditional · none · ref 8 · internal anchor
Consensus Entropy measures inter-VLM output agreement to verify OCR reliability and enable self-improving ensembles, yielding 42.1% F1 gains over single-model judging.
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning cs.CV · 2025-04-02 · unverdicted · none · ref 3 · internal anchor
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning cs.CV · 2025-03-10 · unverdicted · none · ref 9 · internal anchor
AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 5 · internal anchor
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning cs.CV · 2024-12-31 · accept · none · ref 23 · internal anchor
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
Towards High-Resolution Visual Perception via Hierarchical Entity Exploration cs.CV · 2026-07-01 · unverdicted · none · ref 6 · internal anchor
HEE is a training-free, model-agnostic method for high-resolution visual perception in MLLMs using hierarchical entity exploration with dual scoring, detection, clustering, and backtracking.
Wake up for Touch! Mask-isolated Tactile Alignment Learning in MLLMs cs.CV · 2026-07-01 · unverdicted · none · ref 7 · internal anchor
Splash partitions MLLM parameters into dormant and critical subspaces via significance quantification, updating only the dormant subspace for tactile alignment while preserving general capabilities and achieving SOTA on visuo-tactile benchmarks.
PixelEyes: Decoupling Perception and Reasoning for Pinpoint Visual Evidence Seeking cs.CV · 2026-06-30 · unverdicted · none · ref 6 · internal anchor
PixelEyes decouples reasoning and perception via mask-guided search and semantic BFS, introduces PixelEyes-6K dataset and Pinpoint-Bench benchmark, and open-sources code and models.
CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts cs.CV · 2026-06-30 · unverdicted · none · ref 7 · internal anchor
CoLT replaces text-based chain-of-thought in MLLMs with 3-step latent thought chains supervised by a removable external decoder in forward and backward modes, yielding 10.1x faster inference on eight benchmarks.
InstanceControl: Controllable Complex Image Generation without Instance Labeling cs.CV · 2026-06-30 · unverdicted · none · ref 4 · internal anchor
InstanceControl uses VLMs to auto-generate instance masks from text and visual conditions, with adaptive refinement, to enable controllable multi-object image generation without manual labeling.
Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning cs.CL · 2026-06-29 · unverdicted · none · ref 9 · internal anchor
PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.
Verifiable Geometry Problem Solving: Solver-Driven Autoformalization and Theorem Proposing cs.AI · 2026-06-26 · unverdicted · none · ref 14 · internal anchor
SD-GPS uses solver-driven autoformalization via QwenVL3-2B with RL on executability and an impasse-aware verified theorem proposer to outperform prior methods on Geometry3K and PGPS9K.
TOPS: First-Principles Visual Token Pruning via Constructing Token Optimal Preservation Sets for Efficient MLLM Inference cs.AI · 2026-06-25 · unverdicted · none · ref 12 · internal anchor
TOPS formulates visual token pruning as constructing Token Optimal Preservation Sets using three information-theoretic principles and demonstrates superior performance on MLLM benchmarks.
GAVEL: Grounded Caption Error Verification and Localization cs.CL · 2026-06-25 · unverdicted · none · ref 32 · internal anchor
GAVEL introduces a joint task, dataset, and benchmark for verifying, explaining, and localizing caption-image misalignments, with a supervised baseline that improves grounding and explanation metrics over strong closed-source models.
RadarTwin: Scene-Specific mmWave Radar Simulation and Learning for Mobile Indoor Perception cs.CV · 2026-06-24 · unverdicted · none · ref 8 · internal anchor
RadarTwin produces deployment-specific mmWave radar simulations from 3D models that enable real-object recognition at 2.5 times chance with zero real labels and 95.3% accuracy with few labels on a 12-way task.
ForensicsTok: Forensics-Guided Tokenized Modeling for Image Tampering Localization cs.CV · 2026-06-23 · unverdicted · none · ref 26 · 2 links · internal anchor
ForensicsTok turns image manipulation localization into autoregressive token generation with a smoothing decoder and multi-scale forensic feature fusion, showing gains over MLLM baselines and slight gains over expert models on six benchmarks.
VisCritic: Visual State Comparison as Process Reward for GUI Agents cs.CV · 2026-06-23 · unverdicted · none · ref 6 · internal anchor
VisCritic uses visual comparison of pre- and post-action GUI screenshots via a Siamese vision transformer and Action-Aware Critic Head to provide process rewards, improving agent performance on benchmarks.
Spectral Evolution-Guided Token Pruning in Multimodal Large Language Models cs.CV · 2026-06-23 · unverdicted · none · ref 8 · internal anchor
CLSE prunes tokens in MLLMs by quantifying cross-layer spectral redistribution in the frequency domain to preserve semantically active tokens and reduce compute.
Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views cs.CV · 2026-06-22 · unverdicted · none · ref 5 · internal anchor
DR-MV3D decomposes MV3D-VQA into global map construction, question-conditioned view planning, and egocentric grounding, supervised by global consistency and local trajectory rewards optimized via GRPO.
HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning cs.CV · 2026-06-19 · unverdicted · none · ref 16 · internal anchor
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models cs.CV · 2026-06-17 · unverdicted · none · ref 8 · internal anchor
PerceptionDLM enables parallel region captioning in multimodal diffusion language models via prompting and attention masking, introduces ParaDLC-Bench, and claims first parallel region perception with DLMs.
EventDrive: Event Cameras for Vision-Language Driving Intelligence cs.CV · 2026-06-16 · unverdicted · none · ref 12 · internal anchor
EventDrive supplies a multi-task benchmark and EventDrive-VLM architecture that fuses event data, RGB, and language supervision, reporting gains in temporal precision and motion awareness for driving intelligence.
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers cs.CV · 2026-06-11 · unverdicted · none · ref 150 · internal anchor
HYDRA-X presents the first unified multimodal model using a single ViT for holistic image-video tokenization, with ablations on attention and compression plus a latent-level editing improvement.
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning cs.CV · 2026-06-10 · unverdicted · none · ref 40 · internal anchor
ReRe boosts open-source MLLMs on spatial reasoning benchmarks VSI-Bench and STI-Bench to rival proprietary SOTA by using a two-phase Reason then Re-reason process with Geometry-to-Video novel view synthesis.
The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models cs.CV · 2026-06-05 · unverdicted · none · ref 9 · internal anchor
FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.
Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding cs.CV · 2026-06-05 · unverdicted · none · ref 106 · internal anchor
LyraV uses FDTC and SToP for per-frame incremental decoding to reach 98.29% video synchrony at 3.89 FPS while preserving general understanding.
MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models cs.CV · 2026-06-05 · unverdicted · none · ref 3 · internal anchor
MotionEnhancer distills motion priors from video diffusion models into VLMs via parameter-free attention alignment modules to improve motion-level video understanding.
MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models cs.CV · 2026-06-04 · unverdicted · none · ref 2 · internal anchor
MedSIGHT unifies medical image comprehension and segmentation in Med-LVLMs via a Region Perceiver module and region codebook, trained progressively on 72K pairs to reach SOTA on both tasks across modalities.
WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark cs.CV · 2026-06-04 · unverdicted · none · ref 104 · internal anchor
WorldBench is a visually diverse multimodal reasoning benchmark where the strongest of 15 tested MLLMs reaches only 64% accuracy.
GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning cs.CV · 2026-06-03 · unverdicted · none · ref 10 · internal anchor
GOPAgen proposes integrating video codec GOPs with a motion agent, GOP tree reasoning, structural memory, and motion vector database to improve efficiency and motion detail in agentic long-video VQA, reporting gains on MotionBench and EgoSchema.
Learning to Solve, Forgetting to Retain: Correct-Set Turnover in RLVR cs.LG · 2026-06-02 · unverdicted · none · ref 85 · internal anchor
RLVR exhibits correct-set turnover where solved problems regress during training, and a periodic review mechanism exploiting a repair-window principle improves retention and performance over baselines.
Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models cs.CV · 2026-06-01 · conditional · none · ref 24 · internal anchor
Leading VLMs show high cross-view consistency paired with low metric accuracy on distance queries, indicating evidence-insensitive reasoning rather than geometric grounding.
V-LynX: Token Interface Alignment for Video+X LLMs cs.CV · 2026-05-30 · unverdicted · none · ref 2 · internal anchor
V-LynX integrates novel modalities into frozen Video LLMs by aligning to an internalized continuous token manifold using unpaired unimodal data and attention/statistical matching.
Vision-Language Models Suppress Female Representations Under Ambiguous Input cs.CV · 2026-05-29 · unverdicted · none · ref 1 · internal anchor
VLMs encode female associations internally for ambiguous images of female-stereotyped occupations but output male due to asymmetric layer-wise suppression, revealed by the new LALS metric across 15 occupations and four models.
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration cs.AI · 2026-05-29 · unverdicted · none · ref 4 · internal anchor
SCALE introduces three adversarial roles (Selector, Predictor, Judger) and a graph exploration method (SCALE-Hop) to enable MLLM-based web agents to self-discover limitations and improve, backed by the SCALE-20k dataset from 19 websites.
PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding cs.CV · 2026-05-28 · unverdicted · none · ref 36 · internal anchor
PARCEL is a new visual tokenization architecture combining pool-anchored resampling with conditioned elastic queries to enhance performance-efficiency tradeoffs in LVLMs over prior matryoshka methods.
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning cs.CV · 2026-05-27 · unverdicted · none · ref 11 · internal anchor
ROVER introduces a learnable routing plugin for object-centric visual evidence in MLLMs via token triplets and differential attention, reporting gains on MM-GCoT and VideoEspresso when integrated into Qwen2.5-VL-7B.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer