Kimi-VL Technical Report

Bohong Yin; Bowei Xing; Bowen Qu; Bowen Wang; Cheng Chen; Chenlin Zhang; Chenzhuang Du; Chu Wei; Congcong Wang; Dehao Zhang

arxiv: 2504.07491 · v3 · submitted 2025-04-10 · 💻 cs.CV

Kimi-VL Technical Report

Kimi Team: Angang Du , Bohong Yin , Bowei Xing , Bowen Qu , Bowen Wang , Cheng Chen , Chenlin Zhang , Chenzhuang Du

show 86 more authors

Chu Wei Congcong Wang Dehao Zhang Dikang Du Dongliang Wang Enming Yuan Enzhe Lu Fang Li Flood Sung Guangda Wei Guokun Lai Han Zhu Hao Ding Hao Hu Hao Yang Hao Zhang Haoning Wu Haotian Yao Haoyu Lu Heng Wang Hongcheng Gao Huabin Zheng Jiaming Li Jianlin Su Jianzhou Wang Jiaqi Deng Jiezhong Qiu Jin Xie Jinhong Wang Jingyuan Liu Junjie Yan Kun Ouyang Liang Chen Lin Sui Longhui Yu Mengfan Dong Mengnan Dong Nuo Xu Pengyu Cheng Qizheng Gu Runjie Zhou Shaowei Liu Sihan Cao Tao Yu Tianhui Song Tongtong Bai Wei Song Weiran He Weixiao Huang Weixin Xu Xiaokun Yuan Xingcheng Yao Xingzhe Wu Xinhao Li Xinxing Zu Xinyu Zhou Xinyuan Wang Y. Charles Yan Zhong Yang Li Yangyang Hu Yanru Chen Yejie Wang Yibo Liu Yibo Miao Yidao Qin Yimin Chen Yiping Bao Yiqin Wang Yongsheng Kang Yuanxin Liu Yuhao Dong Yulun Du Yuxin Wu Yuzhi Wang Yuzi Yan Zaida Zhou Zhaowei Li Zhejun Jiang Zheng Zhang Zhilin Yang Zhiqi Huang Zihao Huang Zijia Zhao Ziwei Chen Zongyu Lin

This is my paper

Pith reviewed 2026-05-11 01:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelmixture of expertsmultimodal agentlong contexthigh-resolution visionopen-source model

0 comments

The pith

Kimi-VL is an open-source MoE vision-language model activating only 2.8B parameters that matches flagship models on multi-turn agent tasks and long-context understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kimi-VL as a Mixture-of-Experts vision-language model built for efficiency and strong multimodal performance. It reports competitive results against larger systems on agent benchmarks like OSWorld, college-level image and video tasks, OCR, mathematical reasoning, and multi-image understanding. The model uses a 128K context window and a native-resolution vision encoder to handle long inputs and ultra-high-resolution images at lower cost. A long-thinking variant trained with chain-of-thought and reinforcement learning further extends reasoning on complex problems.

Core claim

Kimi-VL shows that a sparse MoE vision-language model with 2.8B active language-decoder parameters can reach or exceed the performance of much larger closed models on agentic tasks, long video comprehension, document understanding, and high-resolution perception while remaining computationally efficient.

What carries the argument

Mixture-of-Experts architecture in the language decoder paired with the native-resolution MoonViT vision encoder that processes high-resolution inputs directly.

If this is right

The model processes 128K-token contexts to score 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc.
It reaches 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro through direct high-resolution perception.
The Thinking variant scores 64.0 on MMMU and 80.1 on MathVista after long chain-of-thought training.
All weights and code are released publicly for further use and inspection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Efficient sparse models of this scale may lower the barrier to running advanced vision agents locally.
Open release of the weights could let researchers test whether the reported agent performance holds under varied prompting or new environments.
The long-thinking training recipe might generalize to other VLMs that currently struggle with multi-step visual reasoning.

Load-bearing premise

The reported benchmark scores on tasks like OSWorld and ScreenSpot-Pro reflect genuine general capabilities rather than results shaped by test contamination or undisclosed evaluation choices.

What would settle it

An independent run of the same Kimi-VL weights on the public OSWorld or ScreenSpot-Pro test sets that produces scores more than 10 points below those claimed in the report.

read the original abstract

We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kimi-VL is a straightforward open-source MoE VLM release that claims competitive agent and long-context results at low active parameter count, but the evaluation details are thin.

read the letter

The core takeaway is that Moonshot has released Kimi-VL, a 2.8B-active MoE vision-language model with a native-resolution MoonViT encoder, plus a Thinking variant trained on long CoT SFT and RL. It reports matching flagship models on multi-turn agent tasks like OSWorld and solid numbers on LongVideoBench, InfoVQA, and MMMU while staying efficient on standard inputs. The code and weights are public, which is the main practical value here.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Kimi-VL, an open-source MoE vision-language model activating 2.8B parameters in its language decoder, along with a long-thinking variant Kimi-VL-Thinking-2506. It claims strong performance on multimodal benchmarks including matching flagship models on multi-turn agent tasks such as OSWorld, scores of 64.5 on LongVideoBench, 83.2 on InfoVQA, 34.5 on ScreenSpot-Pro, and surpassing GPT-4o in several domains, while also reporting results on MMMU (64.0), MMMU-Pro (46.3), MathVision (56.9), and VideoMMMU (65.2) for the thinking variant. The work emphasizes efficiency via MoonViT native-resolution encoder, 128K context support, public code/models, and advances in long-context and agent capabilities.

Significance. If the reported benchmark results prove robust and reproducible, the work is significant as an efficient open-source VLM that competes with or exceeds closed models like GPT-4o and GPT-4o-mini on agent, long-video, and high-resolution tasks. The public release of code and models at the cited GitHub repository is a clear strength that enables independent verification and extension. The combination of MoE efficiency, native-resolution vision, and long-CoT RL training offers a practical contribution to accessible multimodal systems.

major comments (3)

[Benchmark results / Experiments] Benchmark results section (e.g., tables reporting OSWorld, LongVideoBench, InfoVQA, ScreenSpot-Pro): The manuscript provides no description of the precise evaluation protocol for multi-turn agent tasks, including the agent scaffolding, observation format, tool-use loop, maximum turns, or exact prompting used for Kimi-VL versus baselines. This detail is load-bearing for the central claim of matching flagship models on OSWorld and for fair comparison to closed models.
[Results and Discussion] Results tables and text on LongVideoBench (64.5), MMLongBench-Doc (35.1), and MMMU-Pro (46.3): No error bars, standard deviations, number of evaluation runs, or statistical significance tests are reported. Given the strong claims of surpassing GPT-4o in key domains, this omission prevents assessment of whether differences are reliable.
[Model variants and training] Training and evaluation details for Kimi-VL-Thinking-2506: The long CoT SFT and RL procedure is described at high level only, with no information on the composition of the long-horizon reasoning data, reward model, or decontamination steps for benchmarks such as MMMU and MathVision. These omissions directly affect interpretability of the reported gains (e.g., 64.0 on MMMU).

minor comments (3)

[Model architecture] The abstract and main text introduce MoonViT without a dedicated subsection or diagram detailing its architecture, resolution handling, or parameter count relative to the MoE decoder; a short technical description would improve clarity.
[References] Several benchmark names and scores are listed without citing the original papers or providing links in the text or references section, which is standard for technical reports.
[Figures] Figure captions for any architecture or benchmark comparison plots could be expanded to include exact model versions and evaluation settings for immediate readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects of reproducibility and transparency. We address each major comment below and will revise the manuscript to incorporate additional details where possible.

read point-by-point responses

Referee: Benchmark results section (e.g., tables reporting OSWorld, LongVideoBench, InfoVQA, ScreenSpot-Pro): The manuscript provides no description of the precise evaluation protocol for multi-turn agent tasks, including the agent scaffolding, observation format, tool-use loop, maximum turns, or exact prompting used for Kimi-VL versus baselines. This detail is load-bearing for the central claim of matching flagship models on OSWorld and for fair comparison to closed models.

Authors: We agree that precise evaluation protocols are essential for reproducibility and fair comparisons, particularly for multi-turn agent tasks. In the revised manuscript, we will add a dedicated subsection in the Experiments section that explicitly describes the agent scaffolding, observation format, tool-use loop, maximum number of turns, and the exact prompting templates used for Kimi-VL as well as the baseline models on OSWorld and related tasks. This will directly support the reported performance claims. revision: yes
Referee: Results tables and text on LongVideoBench (64.5), MMLongBench-Doc (35.1), and MMMU-Pro (46.3): No error bars, standard deviations, number of evaluation runs, or statistical significance tests are reported. Given the strong claims of surpassing GPT-4o in key domains, this omission prevents assessment of whether differences are reliable.

Authors: We acknowledge that the absence of variance estimates limits the ability to assess statistical reliability of the reported differences. In the revised version, we will clarify in the Results section that evaluations were performed with a single run per model (standard practice for many large-scale VLM benchmarks due to computational cost) and add a discussion of this limitation. Where multiple runs were feasible for smaller subsets, we will report them; otherwise, we will qualify the surpassing claims accordingly without overstating robustness. revision: partial
Referee: Training and evaluation details for Kimi-VL-Thinking-2506: The long CoT SFT and RL procedure is described at high level only, with no information on the composition of the long-horizon reasoning data, reward model, or decontamination steps for benchmarks such as MMMU and MathVision. These omissions directly affect interpretability of the reported gains (e.g., 64.0 on MMMU).

Authors: We agree that greater detail on the long CoT SFT and RL training would improve interpretability of the gains for Kimi-VL-Thinking-2506. In the revised manuscript, we will expand the relevant section to include additional information on the composition of the long-horizon reasoning data, the reward model design, and the decontamination procedures applied to benchmarks such as MMMU and MathVision. This will help readers better contextualize the performance numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark report with no derivations or self-referential predictions

full rationale

The paper is a technical report describing the Kimi-VL model architecture (MoE VLM with MoonViT encoder), training process (SFT and RL for the Thinking variant), and performance on external benchmarks such as OSWorld, LongVideoBench, MMMU, InfoVQA, and ScreenSpot-Pro. No mathematical derivations, first-principles predictions, or fitted parameters are presented as novel results. All claims rest on reported benchmark scores compared to external models (GPT-4o, Qwen2.5-VL, etc.). There are no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claims to the paper's own inputs. The derivation chain is absent; the work is self-contained as an empirical evaluation report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

As an empirical technical report on a machine learning model, the work relies on standard assumptions in deep learning such as the effectiveness of transformer-based architectures and gradient-based optimization. No mathematical free parameters or ad-hoc axioms are introduced beyond the model design itself.

invented entities (1)

MoonViT no independent evidence
purpose: Native-resolution vision encoder to handle ultra-high-resolution inputs efficiently
Presented as the vision encoder component of Kimi-VL.

pith-pipeline@v0.9.0 · 6058 in / 1516 out tokens · 58275 ms · 2026-05-11T01:02:24.121144+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model... MoonViT... joint pre-training stages... Long-CoT SFT and RL
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models... 64.5 on LongVideoBench... 83.2 on InfoVQA

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
cs.CV 2026-05 unverdicted novelty 8.0

VLMs fail to detect semantically different image swaps up to 60% of the time despite self-reflective statements, with thinking models more vulnerable and attention analysis showing self-reflection does not increase vi...
Large Language Models Lack Temporal Awareness of Medical Knowledge
cs.LG 2026-05 unverdicted novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
cs.CV 2026-05 unverdicted novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
cs.CV 2025-11 unverdicted novelty 8.0

MVI-Bench supplies the first taxonomy and dataset focused on misleading visual inputs to measure LVLM robustness, with tests on 18 models revealing clear weaknesses.
DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 7.0

DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in...
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
cs.CV 2026-05 unverdicted novelty 7.0

SpaceDG introduces the first large-scale degradation-aware spatial reasoning dataset using 3D Gaussian Splatting synthesis, showing that visual degradations impair MLLM performance but finetuning on the data improves ...
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
cs.CV 2026-05 accept novelty 7.0

WildRoadBench provides a professionally annotated UAV corpus and dual-track protocol showing frontier VLMs and LLM agents achieve limited performance on wild aerial road-damage grounding under unified metrics.
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
cs.AI 2026-05 unverdicted novelty 7.0

PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
cs.LG 2026-05 conditional novelty 7.0

ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 7.0

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
Count Anything at Any Granularity
cs.CV 2026-05 unverdicted novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
cs.LG 2026-05 unverdicted novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
quant-ph 2026-04 unverdicted novelty 7.0

Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
FCMBench-Video: Benchmarking Document Video Intelligence
cs.CV 2026-04 unverdicted novelty 7.0

FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
Can Multimodal Large Language Models Truly Understand Small Objects?
cs.CV 2026-04 unverdicted novelty 7.0

Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Discrete Prototypical Memories for Federated Time Series Foundation Models
cs.LG 2026-04 unverdicted novelty 7.0

FeDPM learns and aligns local discrete prototypical memories across domains to create a unified discrete latent space for LLM-based time series foundation models in a federated setting.
Token Warping Helps MLLMs Look from Nearby Viewpoints
cs.CV 2026-04 unverdicted novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
cs.CV 2026-03 unverdicted novelty 7.0

ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.
VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
cs.CV 2026-02 conditional novelty 7.0

VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.
Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology
cs.CV 2026-01 conditional novelty 7.0

Weather-R1 is a multimodal reasoning model for meteorology that uses logical consistency rewards during reinforcement fine-tuning to cut self-contradictory outputs and raises benchmark accuracy by 9.8 points over baselines.
ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
cs.CV 2025-12 conditional novelty 7.0

ProcObject-10K is the first benchmark for object-centric procedural reasoning in videos that exposes a large gap where models answer questions plausibly but fail to ground their answers in the correct video segments.
From Charts to Code: A Hierarchical Benchmark for Multimodal Models
cs.SE 2025-10 unverdicted novelty 7.0

Chart2Code is a hierarchical benchmark with reproduction, editing, and table-to-chart tasks across 22 chart types that shows even top models like GPT-5 achieve low scores of 0.57 on code evaluation and 0.22 on chart q...
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
cs.CV 2025-07 conditional novelty 7.0

MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
MMSearch-R1: Incentivizing LMMs to Search
cs.CV 2025-06 unverdicted novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
cs.CL 2025-06 conditional novelty 7.0

PuzzleWorld benchmark reveals state-of-the-art AI models solve only 18% of complex puzzlehunt problems with 40% stepwise accuracy, matching novices but trailing enthusiasts, while fine-tuning on traces yields modest gains.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
cs.CV 2025-05 conditional novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
cs.CV 2025-05 conditional novelty 7.0

Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
cs.CV 2025-04 unverdicted novelty 7.0

SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
cs.CV 2024-12 accept novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA
cs.CV 2026-05 unverdicted novelty 6.0

JUDO enhances large multimodal models for industrial anomaly QA by juxtaposing query images with normal ones for visual comparison and using SFT plus GRPO with tailored rewards to inject domain knowledge, outperformin...
Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.
When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack
cs.CR 2026-05 unverdicted novelty 6.0

LLM cascade systems are vulnerable to a new adversarial attack that simultaneously degrades accuracy and destroys the intended cost savings by targeting both the lightweight models and the escalation decision mechanism.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making
cs.CL 2026-05 unverdicted novelty 6.0

Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Proposes AGSR and the FAB-G supervised multi-agent framework that predicts attribute salience from human annotations to constrain MLLM emotion reasoning, yielding gains on EmoArt and cross-dataset tests.
UAM: A Dual-Stream Perspective on Forgetting in VLA Training
cs.CV 2026-05 unverdicted novelty 6.0

UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation t...
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
cs.LG 2026-05 unverdicted novelty 6.0

ConSPO introduces a contrastive sequence-level policy optimization that aligns rollout scores with generation likelihoods via length-normalized log-probabilities and an InfoNCE-style group contrast with curriculum mar...
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 6.0

ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
cs.AI 2026-05 unverdicted novelty 6.0

NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
Dimension-Free Saddle-Point Escape in Muon
cs.LG 2026-05 unverdicted novelty 6.0

Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
cs.CV 2026-05 unverdicted novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 6.0

VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
cs.CV 2026-04 unverdicted novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
cs.CV 2026-04 unverdicted novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.