DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
super hub Mixed citations
Kimi-VL Technical Report
Mixed citation behavior. Most common role is background (64%).
abstract
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video c
authors
co-cited works
representative citing papers
MedCUA-Bench provides 18 clinical scenarios in 10 domains as a testbed for computer-use agents on medical UIs, with evaluations of 23 agents showing low success rates especially on real systems like OpenEMR.
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
A large examination-level ultrasound dataset with long-form reports enables simple LVLM fine-tuning to outperform prior complex methods.
MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.
Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.
PRCR enables replay-free visual revisiting in interleaved multimodal reasoning by storing raw visual KV caches with spatial coordinates and rebinding keys to position-compatible coordinates, matching replay performance while cutting computation by orders of magnitude.
REALM is the first unified red-teaming benchmark for physical-world VLMs that aligns diverse attack methods via an agentic target-generation pipeline and evaluates them on shared datasets showing text/typographic attacks as most effective.
Processed egocentric human video outperforms teleoperated real-robot trajectories as pretraining data for embodied foundation models, delivering 24% lower validation loss and 52.5-90% higher task success rates under matched post-training protocols.
MODE decomposes expert selection frequency by modality, filters redundant vision tokens, adds per-modality sensitivity, and uses ILP to assign bit-widths, limiting average loss to 2.9% at W3A16 on MoE-MLLMs.
SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.
TVI-CoT introduces learnable control tokens <THINK>, <LOOK>, <ANSWER> that let multimodal LLMs interleave textual reasoning with dynamic visual feature access, reporting gains of 3.4-6.1% on eight benchmarks over prior CoT baselines.
A paired-image benchmark reveals that many MLLMs fail to update predictions when task-critical visual evidence changes, even when they answer individual images correctly.
UltraVR is a new diagnostic benchmark for evidence-grounded VQA on ultra-resolution images, with structured chain-of-thought annotations that localize failures in grounding, perception, and inference.
Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.
HakushoBench provides 2,053 Japanese chart and table images from governmental white papers with QA pairs, showing open-weight VLMs reach only 58.6% accuracy versus higher proprietary performance.
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
EAGLE is a new evidence-aligned framework that improves multi-agent VQA by enforcing consistency in visual grounding across agents, achieving best average performance on six benchmarks.
DriveSpatial benchmark shows the strongest of 15 VLMs trails humans by 28.4 points on spatiotemporal tasks, with cognitive scene construction as the primary weakness.
VideoOdyssey is a new benchmark featuring ultra-long videos (avg. 109 min) across 11 domains with multi-level continuous certificates (avg. 16 min for visual, 12.8 min for audio-visual) to diagnose MLLM limitations in continuous reasoning and omni-modal perception.
SpaceDG is the first large-scale benchmark dataset (~1M QA pairs) simulating nine visual degradations in 3DGS-rendered scenes to measure and improve spatial intelligence robustness in MLLMs.
citing papers explorer
No citing papers match the current filters.