hub Mixed citations

Kimi-VL Technical Report

Kimi Team: Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen · 2025 · cs.CV · arXiv 2504.07491

Mixed citation behavior. Most common role is background (64%).

90 Pith papers citing it

Background 64% of classified citations

open full Pith review browse 90 citing papers arXiv PDF

abstract

We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 24 baseline 8 method 3 other 1

citation-polarity summary

background 23 baseline 8 use method 3 unclear 2

claims ledger

abstract We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video c

co-cited works

representative citing papers

Large Language Models Lack Temporal Awareness of Medical Knowledge

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

cs.CV · 2026-04-10 · accept · novelty 8.0

HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.

VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

SpaceDG introduces the first large-scale degradation-aware spatial reasoning dataset using 3D Gaussian Splatting synthesis, showing that visual degradations impair MLLM performance but finetuning on the data improves robustness and can exceed human levels under degradation.

CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

PAGER achieves 4.1x higher task success in point-precise geometric GUI control by combining topology-aware planning with precision-aligned reinforcement learning on the new PAGE Bench dataset of 4,906 problems.

Count Anything at Any Granularity

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

cs.AI · 2026-05-09 · unverdicted · novelty 7.0 · 3 refs

VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.

Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks QFSD and AgriInsect.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

quant-ph · 2026-04-28 · unverdicted · novelty 7.0

Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.

FCMBench-Video: Benchmarking Document Video Intelligence

cs.CV · 2026-04-28 · unverdicted · novelty 7.0

FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.

Can Multimodal Large Language Models Truly Understand Small Objects?

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

Discrete Prototypical Memories for Federated Time Series Foundation Models

cs.LG · 2026-04-06 · unverdicted · novelty 7.0

FeDPM learns and aligns local discrete prototypical memories across domains to create a unified discrete latent space for LLM-based time series foundation models in a federated setting.

Token Warping Helps MLLMs Look from Nearby Viewpoints

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

cs.CV · 2026-02-04 · conditional · novelty 7.0

VISTA-Bench shows vision-language models degrade on visualized text in images compared to equivalent pure text, with larger gaps under increased perceptual difficulty.

Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology

cs.CV · 2026-01-20 · conditional · novelty 7.0

Weather-R1 is a multimodal reasoning model for meteorology that uses logical consistency rewards during reinforcement fine-tuning to cut self-contradictory outputs and raises benchmark accuracy by 9.8 points over baselines.

citing papers explorer

Showing 40 of 90 citing papers.

Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs cs.LG · 2026-04-20 · unverdicted · none · ref 42 · internal anchor
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better energy efficiency.
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers cs.CV · 2026-04-17 · unverdicted · none · ref 26 · internal anchor
AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch cs.CV · 2026-04-15 · unverdicted · none · ref 41 · internal anchor
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs cs.CV · 2026-04-13 · unverdicted · none · ref 80 · internal anchor
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
Omnimodal Dataset Distillation via High-order Proxy Alignment cs.CV · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
HoPA captures high-order cross-modal alignments via a shared proxy to enable scalable omnimodal dataset distillation with better performance-compression trade-offs.
AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models cs.CL · 2026-04-11 · unverdicted · none · ref 31 · internal anchor
AITP is a new multimodal large language model that uses multimodal chain-of-thought and retrieval-augmented generation of legal knowledge to achieve state-of-the-art results on traffic accident responsibility allocation and related tasks, supported by the DecaTARA benchmark of 67,941 videos.
Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts cs.CV · 2026-04-09 · conditional · none · ref 1 · internal anchor
Multimodal MoE models exhibit 'Seeing but Not Thinking' due to routing distraction where visual inputs fail to activate reasoning experts; a targeted intervention improves results by up to 3.17% across models and benchmarks.
Small Vision-Language Models are Smart Compressors for Long Video Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 12 · internal anchor
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 47 · internal anchor
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding cs.CV · 2026-04-06 · unverdicted · none · ref 23 · internal anchor
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
Optimal Projection-Free Adaptive SGD for Matrix Optimization math.OC · 2026-04-02 · unverdicted · none · ref 6 · internal anchor
Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs? cs.LG · 2026-02-20 · conditional · none · ref 71 · 2 links · internal anchor
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
Continually Evolving Skill Knowledge in Vision Language Action Model cs.RO · 2025-11-22 · unverdicted · none · ref 36 · internal anchor
Stellar VLA achieves continual learning in VLA models by maintaining a growing knowledge space and routing tasks to specialized experts conditioned on semantic relations, delivering strong LIBERO benchmark results with only 1% data replay and successful real-world transfer on dual-arm hardware.
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models cs.CV · 2025-11-18 · conditional · none · ref 44 · internal anchor
OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 126 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
Grounded Reinforcement Learning for Visual Reasoning cs.CV · 2025-05-29 · unverdicted · none · ref 61 · internal anchor
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners cs.AI · 2025-04-19 · unverdicted · none · ref 49 · internal anchor
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles cs.CV · 2025-03-21 · conditional · none · ref 65 · internal anchor
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models cs.CV · 2023-06-23 · unverdicted · none · ref 42 · internal anchor
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Perceptual Flow Network for Visually Grounded Reasoning cs.CV · 2026-05-04 · unverdicted · none · ref 47 · internal anchor
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 60 · internal anchor
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Class-specific diffusion models improve military object detection in a low-data domain cs.CV · 2026-04-20 · unverdicted · none · ref 22 · internal anchor
Class-specific diffusion models fine-tuned on 8-24 real images per class generate synthetic data that improves military vehicle detection by up to 8% mAP50 in low-data regimes, with further gains from ControlNet edge conditioning.
UniMesh: Unifying 3D Mesh Understanding and Generation cs.CV · 2026-04-19 · unverdicted · none · ref 47 · internal anchor
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration cs.AI · 2026-04-15 · unverdicted · none · ref 5 · internal anchor
LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi-agent orchestration.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence cs.CL · 2026-04-08 · unverdicted · none · ref 41 · internal anchor
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
Kimi K2.5: Visual Agentic Intelligence cs.CL · 2026-02-02 · unverdicted · none · ref 57 · internal anchor
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning cs.CV · 2025-08-15 · unverdicted · none · ref 20 · internal anchor
UAV-VL-R1 combines SFT and multi-stage GRPO reinforcement learning on a new 50,019-sample HRVQA-VL dataset to deliver substantially higher zero-shot accuracy on UAV visual reasoning tasks than both its 2B baseline and a 72B-scale model.
Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 71 · internal anchor
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 33 · internal anchor
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training cs.LG · 2026-02-25 · unverdicted · none · ref 33 · internal anchor
Muon+ adds one normalization step after polar orthogonalization in the Muon optimizer, yielding lower training and validation perplexity and faster pre-training across 60M-7B models.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 131 · internal anchor
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving cs.CV · 2026-05-22 · unreviewed · ref 38 · internal anchor
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents cs.CV · 2026-05-19 · unreviewed · ref 14 · internal anchor
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination cs.CV · 2026-05-15 · unreviewed · ref 13 · internal anchor
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective cs.LG · 2026-05-13 · unreviewed · ref 5 · 2 links · internal anchor
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction cs.CL · 2026-05-11 · unreviewed · ref 9 · 2 links · internal anchor
Let ViT Speak: Generative Language-Image Pre-training cs.CV · 2026-05-01 · unreviewed · ref 62 · internal anchor
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference cs.LG · 2026-04-19 · unreviewed · ref 4 · 2 links · internal anchor
Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning cs.LG · 2026-04-11 · unreviewed · ref 1 · internal anchor
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs cs.CV · 2025-11-18 · unreviewed · ref 49 · internal anchor

Kimi-VL Technical Report

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer