hub

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al · 2024

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

browse 14 citing papers

hub tools

JSON dossier citing papers JSON

representative citing papers

UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GRPO training with temporal/spatial IoU rewards.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

cs.CV · 2026-05-11 · conditional · novelty 7.0 · 2 refs

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

cs.CV · 2026-05-09 · conditional · novelty 7.0 · 2 refs

CoLVR uses latent contrastive objectives with angle-based perturbation and RL trajectory rewards to increase exploratory visual reasoning in MLLMs, delivering 5-8% gains on VSP, Jigsaw, and MMStar benchmarks.

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

cs.RO · 2026-05-12 · unverdicted · novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

cs.CV · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.

SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

cs.CV · 2026-04-15 · conditional · novelty 6.0

SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.

OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.

ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.

citing papers explorer

Showing 14 of 14 citing papers.

UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs cs.CV · 2026-05-12 · unverdicted · none · ref 6
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating cs.CV · 2026-05-12 · unverdicted · none · ref 8
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GRPO training with temporal/spatial IoU rewards.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 8
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 6 · 2 links
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization cs.CV · 2026-05-09 · conditional · none · ref 4 · 2 links
CoLVR uses latent contrastive objectives with angle-based perturbation and RL trajectory rewards to increase exploratory visual reasoning in MLLMs, delivering 5-8% gains on VSP, Jigsaw, and MMStar benchmarks.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs cs.CV · 2026-05-08 · unverdicted · none · ref 11
Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 32
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 6
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images cs.CV · 2026-05-12 · unverdicted · none · ref 51
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving cs.CV · 2026-05-11 · unverdicted · none · ref 3 · 2 links
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and trajectory accuracy on the NAVSIM v1 benchmark.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? cs.CV · 2026-05-09 · unverdicted · none · ref 9
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs cs.CV · 2026-04-15 · conditional · none · ref 7
SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 6
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning cs.CV · 2026-05-11 · unverdicted · none · ref 2
ERASE prunes 85% of vision tokens in Qwen2.5-VL-7B while retaining 89.46% accuracy, outperforming prior methods that retain only 78.1%.

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer