arxiv: 2504.07491 · v3 · submitted 2025-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Kimi-VL Technical Report

Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Haoning Wu, Haotian Yao, Hao Yang, Haoyu Lu, Hao Zhang, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jingyuan Liu, Jinhong Wang, Jin Xie, Junjie Yan, Kimi Team: Angang Du, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Weiran He, Wei Song, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinhao Li, Xinxing Zu, Xinyuan Wang, Xinyu Zhou, Yang Li, Yangyang Hu, Yanru Chen, Yan Zhong, Y. Charles, Yejie Wang, Yibo Liu, Yibo Miao, Yidao Qin, Yimin Chen, Yiping Bao, Yiqin Wang, Yongsheng Kang, Yuanxin Liu, Yuhao Dong, Yulun Du, Yuxin Wu, Yuzhi Wang, Yuzi Yan, Zaida Zhou, Zhaowei Li, Zhejun Jiang, Zheng Zhang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Zijia Zhao, Ziwei Chen, Zongyu Lin

Pith reviewed 2026-05-11 01:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelmixture of expertsmultimodal agentlong contexthigh-resolution visionopen-source model

0 comments

The pith

Kimi-VL is an open-source MoE vision-language model activating only 2.8B parameters that matches flagship models on multi-turn agent tasks and long-context understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kimi-VL as a Mixture-of-Experts vision-language model built for efficiency and strong multimodal performance. It reports competitive results against larger systems on agent benchmarks like OSWorld, college-level image and video tasks, OCR, mathematical reasoning, and multi-image understanding. The model uses a 128K context window and a native-resolution vision encoder to handle long inputs and ultra-high-resolution images at lower cost. A long-thinking variant trained with chain-of-thought and reinforcement learning further extends reasoning on complex problems.

Core claim

Kimi-VL shows that a sparse MoE vision-language model with 2.8B active language-decoder parameters can reach or exceed the performance of much larger closed models on agentic tasks, long video comprehension, document understanding, and high-resolution perception while remaining computationally efficient.

What carries the argument

Mixture-of-Experts architecture in the language decoder paired with the native-resolution MoonViT vision encoder that processes high-resolution inputs directly.

If this is right

The model processes 128K-token contexts to score 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc.
It reaches 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro through direct high-resolution perception.
The Thinking variant scores 64.0 on MMMU and 80.1 on MathVista after long chain-of-thought training.
All weights and code are released publicly for further use and inspection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Efficient sparse models of this scale may lower the barrier to running advanced vision agents locally.
Open release of the weights could let researchers test whether the reported agent performance holds under varied prompting or new environments.
The long-thinking training recipe might generalize to other VLMs that currently struggle with multi-step visual reasoning.

Load-bearing premise

The reported benchmark scores on tasks like OSWorld and ScreenSpot-Pro reflect genuine general capabilities rather than results shaped by test contamination or undisclosed evaluation choices.

What would settle it

An independent run of the same Kimi-VL weights on the public OSWorld or ScreenSpot-Pro test sets that produces scores more than 10 points below those claimed in the report.

read the original abstract

We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kimi-VL is a straightforward open-source MoE VLM release that claims competitive agent and long-context results at low active parameter count, but the evaluation details are thin.

read the letter

The core takeaway is that Moonshot has released Kimi-VL, a 2.8B-active MoE vision-language model with a native-resolution MoonViT encoder, plus a Thinking variant trained on long CoT SFT and RL. It reports matching flagship models on multi-turn agent tasks like OSWorld and solid numbers on LongVideoBench, InfoVQA, and MMMU while staying efficient on standard inputs. The code and weights are public, which is the main practical value here.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Kimi-VL, an open-source MoE vision-language model activating 2.8B parameters in its language decoder, along with a long-thinking variant Kimi-VL-Thinking-2506. It claims strong performance on multimodal benchmarks including matching flagship models on multi-turn agent tasks such as OSWorld, scores of 64.5 on LongVideoBench, 83.2 on InfoVQA, 34.5 on ScreenSpot-Pro, and surpassing GPT-4o in several domains, while also reporting results on MMMU (64.0), MMMU-Pro (46.3), MathVision (56.9), and VideoMMMU (65.2) for the thinking variant. The work emphasizes efficiency via MoonViT native-resolution encoder, 128K context support, public code/models, and advances in long-context and agent capabilities.

Significance. If the reported benchmark results prove robust and reproducible, the work is significant as an efficient open-source VLM that competes with or exceeds closed models like GPT-4o and GPT-4o-mini on agent, long-video, and high-resolution tasks. The public release of code and models at the cited GitHub repository is a clear strength that enables independent verification and extension. The combination of MoE efficiency, native-resolution vision, and long-CoT RL training offers a practical contribution to accessible multimodal systems.

major comments (3)

[Benchmark results / Experiments] Benchmark results section (e.g., tables reporting OSWorld, LongVideoBench, InfoVQA, ScreenSpot-Pro): The manuscript provides no description of the precise evaluation protocol for multi-turn agent tasks, including the agent scaffolding, observation format, tool-use loop, maximum turns, or exact prompting used for Kimi-VL versus baselines. This detail is load-bearing for the central claim of matching flagship models on OSWorld and for fair comparison to closed models.
[Results and Discussion] Results tables and text on LongVideoBench (64.5), MMLongBench-Doc (35.1), and MMMU-Pro (46.3): No error bars, standard deviations, number of evaluation runs, or statistical significance tests are reported. Given the strong claims of surpassing GPT-4o in key domains, this omission prevents assessment of whether differences are reliable.
[Model variants and training] Training and evaluation details for Kimi-VL-Thinking-2506: The long CoT SFT and RL procedure is described at high level only, with no information on the composition of the long-horizon reasoning data, reward model, or decontamination steps for benchmarks such as MMMU and MathVision. These omissions directly affect interpretability of the reported gains (e.g., 64.0 on MMMU).

minor comments (3)

[Model architecture] The abstract and main text introduce MoonViT without a dedicated subsection or diagram detailing its architecture, resolution handling, or parameter count relative to the MoE decoder; a short technical description would improve clarity.
[References] Several benchmark names and scores are listed without citing the original papers or providing links in the text or references section, which is standard for technical reports.
[Figures] Figure captions for any architecture or benchmark comparison plots could be expanded to include exact model versions and evaluation settings for immediate readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects of reproducibility and transparency. We address each major comment below and will revise the manuscript to incorporate additional details where possible.

read point-by-point responses

Referee: Benchmark results section (e.g., tables reporting OSWorld, LongVideoBench, InfoVQA, ScreenSpot-Pro): The manuscript provides no description of the precise evaluation protocol for multi-turn agent tasks, including the agent scaffolding, observation format, tool-use loop, maximum turns, or exact prompting used for Kimi-VL versus baselines. This detail is load-bearing for the central claim of matching flagship models on OSWorld and for fair comparison to closed models.

Authors: We agree that precise evaluation protocols are essential for reproducibility and fair comparisons, particularly for multi-turn agent tasks. In the revised manuscript, we will add a dedicated subsection in the Experiments section that explicitly describes the agent scaffolding, observation format, tool-use loop, maximum number of turns, and the exact prompting templates used for Kimi-VL as well as the baseline models on OSWorld and related tasks. This will directly support the reported performance claims. revision: yes
Referee: Results tables and text on LongVideoBench (64.5), MMLongBench-Doc (35.1), and MMMU-Pro (46.3): No error bars, standard deviations, number of evaluation runs, or statistical significance tests are reported. Given the strong claims of surpassing GPT-4o in key domains, this omission prevents assessment of whether differences are reliable.

Authors: We acknowledge that the absence of variance estimates limits the ability to assess statistical reliability of the reported differences. In the revised version, we will clarify in the Results section that evaluations were performed with a single run per model (standard practice for many large-scale VLM benchmarks due to computational cost) and add a discussion of this limitation. Where multiple runs were feasible for smaller subsets, we will report them; otherwise, we will qualify the surpassing claims accordingly without overstating robustness. revision: partial
Referee: Training and evaluation details for Kimi-VL-Thinking-2506: The long CoT SFT and RL procedure is described at high level only, with no information on the composition of the long-horizon reasoning data, reward model, or decontamination steps for benchmarks such as MMMU and MathVision. These omissions directly affect interpretability of the reported gains (e.g., 64.0 on MMMU).

Authors: We agree that greater detail on the long CoT SFT and RL training would improve interpretability of the gains for Kimi-VL-Thinking-2506. In the revised manuscript, we will expand the relevant section to include additional information on the composition of the long-horizon reasoning data, the reward model design, and the decontamination procedures applied to benchmarks such as MMMU and MathVision. This will help readers better contextualize the performance numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark report with no derivations or self-referential predictions

full rationale

The paper is a technical report describing the Kimi-VL model architecture (MoE VLM with MoonViT encoder), training process (SFT and RL for the Thinking variant), and performance on external benchmarks such as OSWorld, LongVideoBench, MMMU, InfoVQA, and ScreenSpot-Pro. No mathematical derivations, first-principles predictions, or fitted parameters are presented as novel results. All claims rest on reported benchmark scores compared to external models (GPT-4o, Qwen2.5-VL, etc.). There are no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claims to the paper's own inputs. The derivation chain is absent; the work is self-contained as an empirical evaluation report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

As an empirical technical report on a machine learning model, the work relies on standard assumptions in deep learning such as the effectiveness of transformer-based architectures and gradient-based optimization. No mathematical free parameters or ad-hoc axioms are introduced beyond the model design itself.

invented entities (1)

MoonViT no independent evidence
purpose: Native-resolution vision encoder to handle ultra-high-resolution inputs efficiently
Presented as the vision encoder component of Kimi-VL.

pith-pipeline@v0.9.0 · 6058 in / 1516 out tokens · 58275 ms · 2026-05-11T01:02:24.121144+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model... MoonViT... joint pre-training stages... Long-CoT SFT and RL
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models... 64.5 on LongVideoBench... 83.2 on InfoVQA

Forward citations

Cited by 59 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Large Language Models Lack Temporal Awareness of Medical Knowledge
cs.LG 2026-05 unverdicted novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
cs.CV 2026-05 unverdicted novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
cs.CV 2026-04 accept novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
cs.LG 2026-05 conditional novelty 7.0

ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 7.0

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
Count Anything at Any Granularity
cs.CV 2026-05 unverdicted novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
cs.LG 2026-05 unverdicted novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
quant-ph 2026-04 unverdicted novelty 7.0

Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
FCMBench-Video: Benchmarking Document Video Intelligence
cs.CV 2026-04 unverdicted novelty 7.0

FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
Can Multimodal Large Language Models Truly Understand Small Objects?
cs.CV 2026-04 unverdicted novelty 7.0

Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Discrete Prototypical Memories for Federated Time Series Foundation Models
cs.LG 2026-04 unverdicted novelty 7.0

FeDPM learns and aligns local discrete prototypical memories across domains to create a unified discrete latent space for LLM-based time series foundation models in a federated setting.
Token Warping Helps MLLMs Look from Nearby Viewpoints
cs.CV 2026-04 unverdicted novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 6.0

ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
cs.AI 2026-05 unverdicted novelty 6.0

NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
Dimension-Free Saddle-Point Escape in Muon
cs.LG 2026-05 unverdicted novelty 6.0

Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
cs.CV 2026-05 unverdicted novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
cs.AI 2026-05 unverdicted novelty 6.0

VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 6.0

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
cs.CV 2026-04 unverdicted novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
cs.CV 2026-04 unverdicted novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
cs.LG 2026-04 unverdicted novelty 6.0

NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
cs.CV 2026-04 unverdicted novelty 6.0

AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Omnimodal Dataset Distillation via High-order Proxy Alignment
cs.CV 2026-04 unverdicted novelty 6.0

HoPA captures high-order cross-modal alignments via a shared proxy to enable scalable omnimodal dataset distillation with better performance-compression trade-offs.
AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

AITP is a new multimodal large language model that uses multimodal chain-of-thought and retrieval-augmented generation of legal knowledge to achieve state-of-the-art results on traffic accident responsibility allocati...
Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
cs.LG 2026-04 unverdicted novelty 6.0

Muon² adds adaptive second-moment preconditioning to Muon, improving spectrum conditioning for faster orthogonalization, outperforming Muon on GPT and LLaMA pre-training from 60M to 1.3B parameters while cutting Newto...
Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts
cs.CV 2026-04 conditional novelty 6.0

Multimodal MoE models exhibit 'Seeing but Not Thinking' due to routing distraction where visual inputs fail to activate reasoning experts; a targeted intervention improves results by up to 3.17% across models and benchmarks.
Small Vision-Language Models are Smart Compressors for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
Optimal Projection-Free Adaptive SGD for Matrix Optimization
math.OC 2026-04 unverdicted novelty 6.0

Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Perceptual Flow Network for Visually Grounded Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
cs.CV 2026-05 unverdicted novelty 5.0

PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
Let ViT Speak: Generative Language-Image Pre-training
cs.CV 2026-05 unverdicted novelty 5.0

GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Class-specific diffusion models improve military object detection in a low-data domain
cs.CV 2026-04 unverdicted novelty 5.0

Class-specific diffusion models fine-tuned on 8-24 real images per class generate synthetic data that improves military vehicle detection by up to 8% mAP50 in low-data regimes, with further gains from ControlNet edge ...
UniMesh: Unifying 3D Mesh Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
cs.AI 2026-04 unverdicted novelty 5.0

LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi...
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
Emerging Properties in Unified Multimodal Pretraining
cs.CV 2025-05 unverdicted novelty 5.0

BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
EasyVideoR1: Easier RL for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.