Qwen3-VL Technical Report

An Yang; Binyuan Hui; Bowen Yu; Bo Zheng; Chang Gao; Chenglong Liu; Chenxu Lv; Chunjiang Ge; Dayiheng Liu; Dunjie Lu

arxiv: 2511.21631 · v2 · submitted 2025-11-26 · 💻 cs.CV · cs.AI

Qwen3-VL Technical Report

Shuai Bai , Yuxuan Cai , Ruizhe Chen , Keqin Chen , Xionghui Chen , Zesen Cheng , Lianghao Deng , Wei Ding

show 56 more authors

Chang Gao Chunjiang Ge Wenbin Ge Zhifang Guo Qidong Huang Jie Huang Fei Huang Binyuan Hui Shutong Jiang Zhaohai Li Mingsheng Li Mei Li Kaixin Li Zicheng Lin Junyang Lin Xuejing Liu Jiawei Liu Chenglong Liu Yang Liu Dayiheng Liu Shixuan Liu Dunjie Lu Ruilin Luo Chenxu Lv Rui Men Lingchen Meng Xuancheng Ren Xingzhang Ren Sibo Song Yuchong Sun Jun Tang Jianhong Tu Jianqiang Wan Peng Wang Pengfei Wang Qiuyue Wang Yuxuan Wang Tianbao Xie Yiheng Xu Haiyang Xu Jin Xu Zhibo Yang Mingkun Yang Jianxin Yang An Yang Bowen Yu Fei Zhang Hang Zhang Xi Zhang Bo Zheng Humen Zhong Jingren Zhou Fan Zhou Jing Zhou Yuanzhi Zhu Ke Zhu

This is my paper

Pith reviewed 2026-05-17 04:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords Qwen3-VLvision-language modellong-context multimodalmultimodal reasoninginterleaved inputvideo understandingMoE architectureMMMU benchmark

0 comments

The pith

Qwen3-VL adds native 256K-token support for interleaved text, images and video while lifting pure-text and multimodal reasoning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Qwen3-VL as the latest vision-language model in its series, built in both dense and mixture-of-experts sizes. It claims three main advances: noticeably better performance on text-only tasks, reliable handling of 256,000-token contexts that mix text with many images or long videos, and stronger results on reasoning benchmarks that combine vision and mathematics. These gains come from three concrete changes to the architecture rather than scale alone. The model family is positioned for use in agentic workflows that require grounding decisions in extended visual and textual records.

Core claim

Qwen3-VL delivers three core pillars: markedly stronger pure-text understanding, robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, and advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks.

What carries the argument

Three upgrades: enhanced interleaved-MRoPE for spatial-temporal modeling, DeepStack for integrating multi-level ViT features into vision-language alignment, and text-based time alignment that replaces earlier RoPE variants with explicit textual timestamps for video.

If this is right

The 256K window enables direct retention and cross-referencing inside long documents that contain many images or inside extended video sequences.
Pure-text capability improves even when the model receives multimodal training.
Both dense and MoE variants achieve the gains under matched token budgets and latency limits.
The resulting models can serve as backbones for image-grounded reasoning and multimodal code generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the long-context claims hold, the models could process complete technical reports or feature-length films for summarization and question answering without chunking.
The text-only gains suggest that careful multimodal pre-training can strengthen rather than trade off against language modeling.
Explicit timestamp alignment may generalize to other temporal media such as audio transcripts paired with video.

Load-bearing premise

The reported benchmark improvements arise primarily from the three listed architectural changes rather than from larger training data, extra compute, or selective evaluation.

What would settle it

Train a comparable baseline model on the same data volume and token budget but omit the three upgrades, then re-run MMMU and MathVista to check whether the performance gap disappears.

read the original abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen3-VL adds three practical tweaks for long interleaved multimodal contexts but the report gives no ablations or numbers to show those tweaks are responsible for the claimed gains.

read the letter

The main takeaway is that this is a standard technical report rolling out the next Qwen vision-language model with some targeted engineering changes for better handling of mixed text, images, and video over long sequences. The three upgrades are an enhanced interleaved-MRoPE for spatial-temporal positioning, DeepStack to pull multi-level ViT features into the language model, and text-based timestamp alignment for video instead of the prior T-RoPE method. These look like sensible incremental moves to tighten alignment and temporal grounding without reinventing the architecture from scratch. The model family spans dense sizes from 2B to 32B plus MoE variants, all with a native 256K token window that works for interleaved multimodal input. That scale of context is useful for applications that need to cross-reference across long videos or documents with visuals embedded.

Referee Report

3 major / 2 minor

Summary. The paper introduces Qwen3-VL, the latest vision-language model in the Qwen series, with dense (2B/4B/8B/32B) and MoE (30B-A3B/235B-A22B) variants. It claims three core strengths: stronger pure-text understanding than comparable text-only models, robust long-context comprehension with a native 256K-token window for interleaved text/image/video inputs, and advanced multimodal reasoning on single-image, multi-image, and video tasks, with leading results on benchmarks such as MMMU, MathVista, and MathVision. The work highlights three architectural upgrades—enhanced interleaved-MRoPE for spatial-temporal modeling, DeepStack for multi-level ViT feature integration, and text-based time alignment for video—and states that these yield superior performance under comparable token budgets and latency constraints.

Significance. If the performance claims are substantiated with controlled evaluations, the work would represent a useful incremental advance in open multimodal models by extending long-context capabilities to interleaved inputs and improving temporal grounding. The provision of both dense and MoE variants across a range of sizes supports practical deployment considerations. However, the absence of isolating experiments limits the ability to credit the listed upgrades specifically.

major comments (3)

[Abstract] Abstract: The central claims of 'superior performance' and 'leading performance' on MMMU, MathVista, and MathVision are asserted without any quantitative scores, baseline comparisons, error bars, or evaluation protocol details. This leaves the primary empirical contribution unsupported by visible evidence.
[Architecture and Experiments] Architecture and evaluation sections: The manuscript attributes the reported gains in pure-text understanding, long-context retention, and multimodal reasoning to the three upgrades (enhanced interleaved-MRoPE, DeepStack, and text-based time alignment). No controlled ablations are described that train otherwise identical models with each upgrade disabled while holding token budget, data mixture, and optimization schedule fixed. Without such comparisons, it is not possible to isolate the contribution of the architectural changes from differences in overall compute or data.
[Long-context evaluation] Long-context claims: The native 256K-token window for interleaved multimodal inputs is presented as a core pillar, yet no details are provided on the maximum tested context length, retrieval accuracy metrics, or cross-referencing performance on long documents or videos.

minor comments (2)

[Abstract] The abstract refers to 'visual-math benchmarks (e.g., MathVista and MathVision)' without clarifying whether these are held-out or overlap with training data mixtures.
[Model variants] Notation for the MoE variants (e.g., 30B-A3B) should be defined explicitly on first use to avoid ambiguity with total vs. active parameters.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the Qwen3-VL technical report. We address each major point below and have revised the manuscript to improve clarity and support for the claims where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'superior performance' and 'leading performance' on MMMU, MathVista, and MathVision are asserted without any quantitative scores, baseline comparisons, error bars, or evaluation protocol details. This leaves the primary empirical contribution unsupported by visible evidence.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version, we have incorporated key benchmark scores (e.g., MMMU, MathVista, MathVision) with brief baseline references and evaluation notes. Full tables, comparisons, and protocol details remain in the Experiments section. revision: yes
Referee: [Architecture and Experiments] Architecture and evaluation sections: The manuscript attributes the reported gains in pure-text understanding, long-context retention, and multimodal reasoning to the three upgrades (enhanced interleaved-MRoPE, DeepStack, and text-based time alignment). No controlled ablations are described that train otherwise identical models with each upgrade disabled while holding token budget, data mixture, and optimization schedule fixed. Without such comparisons, it is not possible to isolate the contribution of the architectural changes from differences in overall compute or data.

Authors: We acknowledge the absence of fully isolated ablations under fixed training conditions. Reproducing such experiments at the reported scales would require prohibitive additional compute. The upgrades are presented as incremental extensions from Qwen2-VL; we have added a discussion section clarifying their design motivations and observed cumulative effects through comparisons to prior variants, while noting the limitations of attributing gains solely to individual components. revision: partial
Referee: [Long-context evaluation] Long-context claims: The native 256K-token window for interleaved multimodal inputs is presented as a core pillar, yet no details are provided on the maximum tested context length, retrieval accuracy metrics, or cross-referencing performance on long documents or videos.

Authors: We appreciate this observation. The revised manuscript expands the long-context evaluation subsection to report the maximum tested lengths (up to 256K tokens for interleaved inputs), retrieval accuracy results (including multimodal needle-in-a-haystack variants), and quantitative cross-referencing performance on long documents and videos. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results do not reduce to inputs by construction.

full rationale

The paper reports three architectural upgrades (enhanced interleaved-MRoPE, DeepStack ViT integration, text-based time alignment) and states superior results on external benchmarks such as MMMU, MathVista, and MathVision under comparable token budgets. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Performance claims are presented as direct empirical outcomes rather than derivations that collapse to the listed changes by construction. The manuscript is self-contained against standard external benchmarks with no evident reduction of the central claims to tautological inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical benchmark results and the three described architectural changes; the abstract lists no explicit free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5861 in / 1160 out tokens · 34864 ms · 2026-05-17T04:28:34.360130+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, IndisputableMonolith/Cost/FunctionalEquation.lean reality_from_one_distinction, washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling... (ii) DeepStack integration... (iii) text-based time alignment for video...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 accept novelty 8.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, with evaluations showing direct QA at 66.4%, best practical agents at 79.1%, and oracle knowledge at 95.4%.
ViMU: Benchmarking Video Metaphorical Understanding
cs.CV 2026-05 unverdicted novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
cs.CL 2026-05 accept novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
cs.CV 2026-05 unverdicted novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation
cs.AI 2026-05 unverdicted novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings
cs.CV 2026-05 conditional novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
cs.CV 2026-05 unverdicted novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments
cs.AI 2026-04 accept novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.
Lost in Translation: Do LVLM Judges Generalize Across Languages?
cs.CL 2026-04 unverdicted novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
cs.CV 2026-04 unverdicted novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
cs.CV 2026-04 unverdicted novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
cs.CV 2026-04 unverdicted novelty 8.0

PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning
cs.CV 2026-04 conditional novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision
cs.CV 2026-02 conditional novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
cs.CR 2026-01 unverdicted novelty 8.0

GUIGuard-Bench is a new benchmark with annotated GUI screenshots that measures privacy recognition, planning fidelity under protection, and utility impact for trajectory-based GUI agents.
Common to Whom? Regional Cultural Commonsense and LLM Bias in India
cs.CL 2026-01 unverdicted novelty 8.0

Cultural commonsense in India is mostly regional, with only 39.4% agreement across five regions, and LLMs achieve just 13.4-20.9% accuracy while over-representing North and Central areas.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
cs.CV 2026-01 unverdicted novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
cs.CV 2026-01 unverdicted novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
cs.CV 2025-12 accept novelty 8.0

ToG-Bench is the first benchmark for task-oriented spatio-temporal video grounding in egocentric videos, with explicit-implicit dual grounding and one-to-many object scenarios across 100 ScanNet clips and 2704 instructions.
ETCHR: Editing To Clarify and Harness Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

A decoupled question-conditioned image editor trained via supervised imitation then VLM-reward enhancement improves MLLM visual reasoning Pass@1 by 4.6-5.5 points across models and tasks.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
cs.CV 2026-05 unverdicted novelty 7.0

ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
cs.CV 2026-05 unverdicted novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 7.0

DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in...
Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
cs.CV 2026-05 conditional novelty 7.0

Video-LLMs exhibit directional motion blindness from a direction binding gap; DeltaDirect projector objective lifts synthetic accuracy to 85.4% and real accuracy by 21.9 points while preserving other video capabilities.
FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning
cs.CV 2026-05 unverdicted novelty 7.0

FashionLens is a task-adaptive MLLM framework that achieves SOTA performance on diverse fashion image retrieval scenarios via spherical query calibration and gradient-guided sampling.
Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence
cs.CV 2026-05 unverdicted novelty 7.0

FundusGround is a new benchmark with 10,719 fundus images, 15,595 ETDRS-grid localized lesions, and 72,706 VQA questions to support clinically interpretable ophthalmic visual question answering.
Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability
cs.AI 2026-05 unverdicted novelty 7.0

Introduces Synergistic Faithfulness metric based on Shapley Interaction Index to evaluate cross-modal synergy in VLM explainers, revealing over-reliance on visual salience in existing methods.
Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
cs.AI 2026-05 unverdicted novelty 7.0

Introduces the Grounded Personality Reasoning task and MM-OCEAN dataset to show that MLLMs frequently produce correct Big Five personality ratings without grounding them in observable video evidence.
MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing
cs.AI 2026-05 unverdicted novelty 7.0

MPDocBench-Parse provides a 3,246-page benchmark and evaluation protocol for multi-page document parsing that tests text/table/formula extraction, merging, figure handling, reading order, and heading hierarchy.
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
cs.CV 2026-05 conditional novelty 7.0

JMed48k is a new large-scale benchmark of Japanese medical licensing exams with images that reveals proprietary VLMs benefit more from visuals than medical-specific models, with large variation across professions.
AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding
cs.CV 2026-05 accept novelty 7.0

AgroVG is a new multi-source benchmark for agricultural visual grounding formulated as generalized set prediction, with protocols for box and mask grounding across single-target, multi-target, and target-absent querie...
Visual-Advantage On-Policy Distillation for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
cs.CV 2026-05 unverdicted novelty 7.0

MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on p...
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
cs.CV 2026-05 unverdicted novelty 7.0

GenEvolve proposes a self-evolving agent framework for open-ended image generation that uses tool-orchestrated trajectories and visual experience distillation from best-worst differences to achieve reported state-of-t...
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

ArchSIBench is a new benchmark dataset and evaluation suite that measures vision-language models on architectural spatial intelligence across 17 subtasks, showing most models lag human baselines especially in transfor...
Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors
cs.CV 2026-05 unverdicted novelty 7.0

LangTail uses entity-level semantic priors from language models aligned via contrastive learning in a hierarchical clustering setup to resolve long-tail ambiguity, yielding +13.5, +12.9, and +8.9 mIoU gains on ScanNet...
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
cs.CV 2026-05 conditional novelty 7.0

MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation, spanning video, audio, shot, and reference dimensions with an adaptive evaluation framework that reaches 91.5% Spearman correlation...
SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
cs.CV 2026-05 unverdicted novelty 7.0

SetCon achieves state-of-the-art open-ended referring segmentation by using LVLM-generated set-level concepts for joint mask decoding, with gains increasing for multi-target cases on image and video benchmarks.
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

A new dual-protocol expert benchmark for image aesthetics is fused into ground truth and used to self-distill a VLM, raising SRCC from 0.504 to 0.709 across categories while matching closed-source performance.
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation
cs.CV 2026-05 conditional novelty 7.0

PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer...
EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
cs.LG 2026-05 conditional novelty 7.0

CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
Vision Harnessing Agent for Open Ad-hoc Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue
cs.CV 2026-05 unverdicted novelty 7.0

LMM-Track4D formulates a trajectory-grounded dialogue task, releases Track4D-Bench with 526 samples, and proposes RTGE encoding, TRK state token, and OSK-RA decoder to elicit better 4D spatiotemporal reasoning in LMMs.
Modality-Decoupled Online Recursive Editing
cs.LG 2026-05 conditional novelty 7.0

M-ORE decouples text and visual update statistics in MLLMs and applies recursive low-rank edits in an orthogonal subspace to reduce cross-modal conflict and long-horizon interference.
Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference
cs.CV 2026-05 unverdicted novelty 7.0

RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or chan...
EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos
cs.CV 2026-05 unverdicted novelty 7.0

EgoExoMem is the first benchmark for cross-view memory reasoning on synchronized egocentric-exocentric videos, where E2-Select raises MLLM accuracy from 55.3% to 58.2% over baselines.
Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models
cs.CV 2026-05 unverdicted novelty 7.0

Incantation is the first video world model to use per-frame natural language conditioning for simultaneous multi-entity control and concept-level cross-entity transfer in interactive video generation.
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

OmniPro is the first benchmark jointly evaluating omni-modal perception, proactive responding, and diverse streaming video understanding tasks using a dual-mode protocol on 2700 samples.
Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models
cs.CV 2026-05 conditional novelty 7.0

SP-CoR is a multimodal LLM framework using dynamics-aware sampling, spectral-physics view fusion, and prompt distillation that outperforms baselines on the new CoopSR benchmark and EgoTeam dataset for multi-robot coop...
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 unverdicted novelty 7.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practica...
Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification
cs.CV 2026-05 unverdicted novelty 7.0

IC-Seg is a new agentic framework using multi-turn clarification and Hi-GRPO hierarchical optimization to resolve ambiguous queries in referring video object segmentation while maintaining performance on standard benchmarks.
Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment
cs.CV 2026-05 unverdicted novelty 7.0

A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.
TriAxialKV: Toward Extreme Low-Precision KV-Cache Quantization for Agentic Inference Tasks
cs.LG 2026-05 unverdicted novelty 7.0

TriAxialKV introduces triaxial mixed-precision KV-cache quantization that matches BF16 accuracy at 4.5x cache size and 30% higher throughput for a Qwen3-VL agent on OSWorld.
HEED: Density-Weighted Residual Alignment for Hybrid Vision-Language Model Distillation
cs.CV 2026-05 unverdicted novelty 7.0

HEED replaces uniform residual alignment with density-weighted alignment using patch self-dissimilarity to improve hybrid VLM distillation, gaining 8.7 points on OCRBench v2 and 5.13 on a 10-benchmark average.