Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Chang Zhou; Dayiheng Liu; Jialin Wang; Jingren Zhou; Jinze Bai; Junyang Lin; Kai Dang; Keqin Chen; Mengfei Du; Peng Wang

arxiv: 2409.12191 · v2 · submitted 2024-09-18 · 💻 cs.CV · cs.AI· cs.CL

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang , Shuai Bai , Sinan Tan , Shijie Wang , Zhihao Fan , Jinze Bai , Keqin Chen , Xuejing Liu

show 11 more authors

Jialin Wang Wenbin Ge Yang Fan Kai Dang Mengfei Du Xuancheng Ren Rui Men Dayiheng Liu Chang Zhou Jingren Zhou Junyang Lin

This is my paper

Pith reviewed 2026-05-23 20:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords Qwen2-VLvision-language modelsdynamic resolutionM-RoPEmultimodal benchmarksscaling lawslarge vision-language models

0 comments

The pith

Qwen2-VL processes images at any resolution via dynamic token counts and reaches GPT-4o level performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Qwen2-VL series as an upgrade that replaces fixed-resolution image handling with a Naive Dynamic Resolution mechanism. This lets the model convert images of varying sizes into different numbers of visual tokens while adding M-RoPE to merge positional signals across text, images, and video in one framework. Scaling the model to 2B, 8B, and 72B parameters plus more training data produces the 72B version that matches top closed models on multimodal benchmarks. A reader would care because the approach promises vision-language systems that adapt to real-world image sizes without forced resizing and that scale more effectively.

Core claim

Qwen2-VL redefines visual processing by introducing the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens, and by integrating Multimodal Rotary Position Embedding (M-RoPE) to facilitate effective fusion of positional information across text, images, and videos under a unified paradigm. Scaling both model size and training data yields the Qwen2-VL-72B model that achieves results comparable to GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks while outperforming other generalist models.

What carries the argument

Naive Dynamic Resolution mechanism that converts images of any size into variable numbers of visual tokens, paired with M-RoPE for cross-modal positional fusion.

If this is right

Visual representations become more efficient because token count matches image content instead of a preset grid.
Positional information from text, images, and video fuses in one embedding space.
Images and videos share the same processing pipeline without separate architectures.
Larger models and more data continue to improve results on multimodal tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting variable token counts could reduce preprocessing steps such as forced resizing in deployed applications.
The scaling observations may inform how much additional data is needed when moving from 8B to 72B parameters.
Similar dynamic mechanisms might extend to other modalities like audio where input length varies widely.

Load-bearing premise

The performance gains come mainly from the new dynamic resolution and position embedding methods rather than from differences in training data quality or evaluation protocols.

What would settle it

A controlled comparison in which a model trained on identical data and compute but using fixed-resolution processing and standard position embeddings reaches the same benchmark scores as Qwen2-VL-72B.

read the original abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract describes new dynamic resolution and M-RoPE for Qwen2-VL but supplies no scores or ablations to support matching GPT-4o performance.

read the letter

Qwen2-VL adds two concrete changes to the prior Qwen-VL line: a Naive Dynamic Resolution scheme that lets the model take images at their native size and produce variable numbers of tokens, plus M-RoPE that extends rotary embeddings to handle text, image, and video positions together. They also run scaling experiments across 2B, 8B, and 72B sizes and release code on GitHub. Those mechanisms address real pain points in current vision-language models, where fixed resolution often forces cropping or resizing that loses detail. The unified image and video pipeline is a reasonable engineering choice. Making the code public is the right move for an open model effort. The problem is that the abstract states the 72B model reaches parity with GPT-4o and Claude 3.5 Sonnet but shows none of the numbers, tables, or ablations that would let us check that. We have no way to see whether the gains trace to the new components or to differences in training data volume and quality. Without those controls the main result stays untestable. This work is aimed at researchers and engineers who follow open-source multimodal models and want to know what the Qwen team is shipping next. It is not yet ready for a serious referee because the evidence for the central claims is missing. Wait for the full version with the benchmark tables and component ablations before investing review time.

Referee Report

2 major / 1 minor

Summary. The paper presents the Qwen2-VL series as an upgrade to prior Qwen-VL models. It introduces a Naive Dynamic Resolution mechanism that processes images of arbitrary resolutions into variable numbers of visual tokens, Multimodal Rotary Position Embedding (M-RoPE) to fuse positional information across text, images, and videos, and a unified image/video processing paradigm. The work explores scaling laws for large vision-language models by training variants at 2B, 8B, and 72B parameters with increased data, claiming that the 72B model reaches performance levels comparable to GPT-4o and Claude 3.5 Sonnet on multimodal benchmarks while outperforming other generalist models.

Significance. If the performance claims and the effectiveness of the proposed mechanisms are substantiated with detailed experiments, the work would offer a meaningful step toward more flexible and human-aligned visual perception in LVLMs by removing fixed-resolution constraints. The scaling investigation could also contribute empirical guidance on model and data scaling for multimodal systems.

major comments (2)

[Abstract] Abstract: The central claim that Qwen2-VL-72B achieves results comparable to GPT-4o and Claude3.5-Sonnet is presented without any benchmark scores, tables, error bars, or quantitative comparisons, preventing assessment of whether the gains derive from the named mechanisms or from unreported differences in training data and protocols.
[Abstract] Abstract: No description, pseudocode, or ablation is supplied for the Naive Dynamic Resolution mechanism or M-RoPE, so it is impossible to verify their load-bearing role in the reported performance or to reproduce the efficiency claims.

minor comments (1)

[Abstract] Abstract: The phrase 'closely aligning with human perceptual processes' is used without supporting evidence or citation.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the comments on the abstract. The points raised are valid regarding the level of detail provided in the summary. We address each below and indicate planned revisions where feasible based on the available manuscript text.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that Qwen2-VL-72B achieves results comparable to GPT-4o and Claude3.5-Sonnet is presented without any benchmark scores, tables, error bars, or quantitative comparisons, preventing assessment of whether the gains derive from the named mechanisms or from unreported differences in training data and protocols.

Authors: We agree that the abstract would be strengthened by including specific quantitative results. The full paper contains benchmark tables with direct comparisons, but to address this concern we will revise the abstract to incorporate key performance metrics (e.g., scores on representative multimodal benchmarks) that support the comparability statement. revision: yes
Referee: [Abstract] Abstract: No description, pseudocode, or ablation is supplied for the Naive Dynamic Resolution mechanism or M-RoPE, so it is impossible to verify their load-bearing role in the reported performance or to reproduce the efficiency claims.

Authors: Abstracts are concise summaries and do not normally contain pseudocode or ablations. The provided manuscript consists only of the abstract, which mentions the mechanisms at a high level but supplies no further technical detail. We can add a brief high-level sentence describing the mechanisms to the abstract, but full descriptions, pseudocode, and ablations cannot be supplied from the available text. revision: partial

standing simulated objections not resolved

Detailed descriptions, pseudocode, and ablation studies for Naive Dynamic Resolution and M-RoPE, which are absent from the provided abstract and cannot be reproduced without the full manuscript body.

Circularity Check

0 steps flagged

No circularity: empirical performance claims with no derivations or load-bearing self-references

full rationale

The abstract (only text available) contains no equations, no derivation chain, and no mathematical steps that could reduce to inputs by construction. It describes mechanisms (Naive Dynamic Resolution, M-RoPE) and scaling, then states empirical benchmark outcomes for Qwen2-VL-72B. No self-citations appear, let alone any that are load-bearing. This matches the default expectation of a non-circular empirical report; the central claim does not reduce to a fit or self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond standard assumptions of large model training.

pith-pipeline@v0.9.0 · 5810 in / 1061 out tokens · 28094 ms · 2026-05-23T20:48:59.575441+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding
cs.CV 2026-05 accept novelty 8.0

NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 accept novelty 8.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, with evaluations showing direct QA at 66.4%, best practical agents at 79.1%, and oracle knowledge at 95.4%.
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
cs.CV 2026-05 unverdicted novelty 8.0

VLMs fail to detect semantically different image swaps up to 60% of the time despite self-reflective statements, with thinking models more vulnerable and attention analysis showing self-reflection does not increase vi...
MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays
cs.CV 2026-05 conditional novelty 8.0

MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.
Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts
cs.CV 2026-05 unverdicted novelty 8.0

An MLLM-guided architecture with a mixture of frequency experts and relational alignment loss achieves state-of-the-art all-in-one image restoration, outperforming prior methods by up to 1.35 dB on the CDD11 dataset.
Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views
cs.CV 2026-05 unverdicted novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
cs.CV 2026-05 unverdicted novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
CADFS: A Big CAD Program Dataset and Framework for Computer-Aided Design with Large Language Models
cs.CV 2026-05 unverdicted novelty 8.0

CADFS supplies a large real-world CAD dataset and FeatureScript representation that, after VLM fine-tuning, produces more accurate and feature-rich designs than prior generative CAD systems.
SpikeMLLM: Spike-based Multimodal Large Language Models via Modality-Specific Temporal Scales and Temporal Compression
cs.NE 2026-04 unverdicted novelty 8.0

SpikeMLLM is the first spike-based MLLM framework that maintains near-lossless performance under aggressive timestep compression and delivers 9x throughput and 25x power efficiency gains via a custom RTL accelerator.
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
cs.CV 2026-04 conditional novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents
cs.CV 2026-03 accept novelty 8.0

VAREX benchmark shows structured output compliance limits models under 4B parameters more than extraction ability, with layout-preserving text giving the largest accuracy gains over images.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
cs.CV 2026-01 unverdicted novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
cs.CV 2025-11 unverdicted novelty 8.0

MVI-Bench supplies the first taxonomy and dataset focused on misleading visual inputs to measure LVLM robustness, with tests on 18 models revealing clear weaknesses.
A document is worth a structured record: Principled inductive bias design for document recognition
cs.CV 2025-07 unverdicted novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, ...
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
cs.RO 2026-05 unverdicted novelty 7.0

AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
cs.AI 2026-05 unverdicted novelty 7.0

Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization
cs.AI 2026-05 unverdicted novelty 7.0

A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
cs.CV 2026-05 conditional novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation
cs.CV 2026-05 conditional novelty 7.0

HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that...
Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding
cs.CV 2026-05 conditional novelty 7.0

Injecting pre-computed layout priors from RT-DETR into VLM prompts raises markdown F1 from 0.37 to 0.92 on a 10k-page OOD benchmark and cuts infinite-loop failures across domains.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
cs.CL 2026-05 conditional novelty 7.0

AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
Modality-Decoupled Online Recursive Editing
cs.LG 2026-05 conditional novelty 7.0

M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.
EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction
cs.CV 2026-05 accept novelty 7.0

EgoTraj is a new open multimodal dataset of 75 long-horizon egocentric human navigation sequences in urban environments with head pose, gaze, and scene data, plus benchmarks of trajectory prediction methods.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 unverdicted novelty 7.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practica...
How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study
cs.AI 2026-05 unverdicted novelty 7.0

EEG study of 27 participants reveals distinct neural patterns for AI-generated hallucinations, with misjudged ones failing to trigger standard fact verification pathways.
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
cs.CR 2026-05 unverdicted novelty 7.0

CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
cs.CV 2026-05 unverdicted novelty 7.0

GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...
Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture
cs.CV 2026-05 unverdicted novelty 7.0

TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% ...
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
cs.CV 2026-05 unverdicted novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
Same Image, Different Meanings: Toward Retrieval of Context-Dependent Meanings
cs.IR 2026-05 unverdicted novelty 7.0

Image meanings grow more context-dependent with semantic abstraction, requiring narrative grounding for accurate retrieval at higher levels.
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 7.0

G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
cs.CV 2026-05 unverdicted novelty 7.0

VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction
cs.CV 2026-05 unverdicted novelty 7.0

DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
cs.CV 2026-05 unverdicted novelty 7.0

Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision
cs.CV 2026-05 unverdicted novelty 7.0

A hierarchical VQA system aggregates model answers into weighted risk scores that produce four-category safety event maps for urban navigation, backed by a new 20-city dataset where generative MLLMs like Qwen-VL outpe...
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
cs.AI 2026-05 unverdicted novelty 7.0

PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
cs.CV 2026-05 unverdicted novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 7.0

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
Follow the Mean: Reference-Guided Flow Matching
cs.LG 2026-05 unverdicted novelty 7.0

Flow matching admits reference-guided control by shifting the conditional endpoint mean, enabling training-free steering of models like FLUX via example banks and a semi-parametric variant on DiT.
Follow the Mean: Reference-Guided Flow Matching
cs.LG 2026-05 unverdicted novelty 7.0

Flow matching admits controllable generation by shifting the conditional endpoint mean computed from a reference set, enabling training-free guidance on frozen pretrained models.
SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation
cs.CV 2026-05 unverdicted novelty 7.0

SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.
StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video
cs.CV 2026-05 unverdicted novelty 7.0

StreamPro introduces a benchmark and training method using CB-Stream Loss and GRPO to enable proactive decision-making in streaming videos, achieving 41.5 on StreamPro-Bench compared to 10.4 previously.
TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents
cs.CL 2026-05 unverdicted novelty 7.0

TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
cs.CV 2026-05 conditional novelty 7.0

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
cs.CV 2026-05 unverdicted novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 7.0

Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

PolarVLM is the first VLM framework to integrate polarimetric physical parameters via dual-stream architecture and progressive training, delivering 25.4% gains over RGB baselines on reflection and transparency tasks w...
PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

PolarVLM integrates polarimetric physical parameters into VLMs via dual-stream architecture and progressive training, outperforming RGB baselines by 25.4% on a new 75K-pair polarization-aware VQA benchmark.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
cs.CV 2026-05 unverdicted novelty 7.0

Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...