Recognition: 2 theorem links
· Lean TheoremKimi-VL Technical Report
Pith reviewed 2026-05-11 01:02 UTC · model grok-4.3
The pith
Kimi-VL is an open-source MoE vision-language model activating only 2.8B parameters that matches flagship models on multi-turn agent tasks and long-context understanding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kimi-VL shows that a sparse MoE vision-language model with 2.8B active language-decoder parameters can reach or exceed the performance of much larger closed models on agentic tasks, long video comprehension, document understanding, and high-resolution perception while remaining computationally efficient.
What carries the argument
Mixture-of-Experts architecture in the language decoder paired with the native-resolution MoonViT vision encoder that processes high-resolution inputs directly.
If this is right
- The model processes 128K-token contexts to score 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc.
- It reaches 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro through direct high-resolution perception.
- The Thinking variant scores 64.0 on MMMU and 80.1 on MathVista after long chain-of-thought training.
- All weights and code are released publicly for further use and inspection.
Where Pith is reading between the lines
- Efficient sparse models of this scale may lower the barrier to running advanced vision agents locally.
- Open release of the weights could let researchers test whether the reported agent performance holds under varied prompting or new environments.
- The long-thinking training recipe might generalize to other VLMs that currently struggle with multi-step visual reasoning.
Load-bearing premise
The reported benchmark scores on tasks like OSWorld and ScreenSpot-Pro reflect genuine general capabilities rather than results shaped by test contamination or undisclosed evaluation choices.
What would settle it
An independent run of the same Kimi-VL weights on the public OSWorld or ScreenSpot-Pro test sets that produces scores more than 10 points below those claimed in the report.
read the original abstract
We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Kimi-VL, an open-source MoE vision-language model activating 2.8B parameters in its language decoder, along with a long-thinking variant Kimi-VL-Thinking-2506. It claims strong performance on multimodal benchmarks including matching flagship models on multi-turn agent tasks such as OSWorld, scores of 64.5 on LongVideoBench, 83.2 on InfoVQA, 34.5 on ScreenSpot-Pro, and surpassing GPT-4o in several domains, while also reporting results on MMMU (64.0), MMMU-Pro (46.3), MathVision (56.9), and VideoMMMU (65.2) for the thinking variant. The work emphasizes efficiency via MoonViT native-resolution encoder, 128K context support, public code/models, and advances in long-context and agent capabilities.
Significance. If the reported benchmark results prove robust and reproducible, the work is significant as an efficient open-source VLM that competes with or exceeds closed models like GPT-4o and GPT-4o-mini on agent, long-video, and high-resolution tasks. The public release of code and models at the cited GitHub repository is a clear strength that enables independent verification and extension. The combination of MoE efficiency, native-resolution vision, and long-CoT RL training offers a practical contribution to accessible multimodal systems.
major comments (3)
- [Benchmark results / Experiments] Benchmark results section (e.g., tables reporting OSWorld, LongVideoBench, InfoVQA, ScreenSpot-Pro): The manuscript provides no description of the precise evaluation protocol for multi-turn agent tasks, including the agent scaffolding, observation format, tool-use loop, maximum turns, or exact prompting used for Kimi-VL versus baselines. This detail is load-bearing for the central claim of matching flagship models on OSWorld and for fair comparison to closed models.
- [Results and Discussion] Results tables and text on LongVideoBench (64.5), MMLongBench-Doc (35.1), and MMMU-Pro (46.3): No error bars, standard deviations, number of evaluation runs, or statistical significance tests are reported. Given the strong claims of surpassing GPT-4o in key domains, this omission prevents assessment of whether differences are reliable.
- [Model variants and training] Training and evaluation details for Kimi-VL-Thinking-2506: The long CoT SFT and RL procedure is described at high level only, with no information on the composition of the long-horizon reasoning data, reward model, or decontamination steps for benchmarks such as MMMU and MathVision. These omissions directly affect interpretability of the reported gains (e.g., 64.0 on MMMU).
minor comments (3)
- [Model architecture] The abstract and main text introduce MoonViT without a dedicated subsection or diagram detailing its architecture, resolution handling, or parameter count relative to the MoE decoder; a short technical description would improve clarity.
- [References] Several benchmark names and scores are listed without citing the original papers or providing links in the text or references section, which is standard for technical reports.
- [Figures] Figure captions for any architecture or benchmark comparison plots could be expanded to include exact model versions and evaluation settings for immediate readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important aspects of reproducibility and transparency. We address each major comment below and will revise the manuscript to incorporate additional details where possible.
read point-by-point responses
-
Referee: Benchmark results section (e.g., tables reporting OSWorld, LongVideoBench, InfoVQA, ScreenSpot-Pro): The manuscript provides no description of the precise evaluation protocol for multi-turn agent tasks, including the agent scaffolding, observation format, tool-use loop, maximum turns, or exact prompting used for Kimi-VL versus baselines. This detail is load-bearing for the central claim of matching flagship models on OSWorld and for fair comparison to closed models.
Authors: We agree that precise evaluation protocols are essential for reproducibility and fair comparisons, particularly for multi-turn agent tasks. In the revised manuscript, we will add a dedicated subsection in the Experiments section that explicitly describes the agent scaffolding, observation format, tool-use loop, maximum number of turns, and the exact prompting templates used for Kimi-VL as well as the baseline models on OSWorld and related tasks. This will directly support the reported performance claims. revision: yes
-
Referee: Results tables and text on LongVideoBench (64.5), MMLongBench-Doc (35.1), and MMMU-Pro (46.3): No error bars, standard deviations, number of evaluation runs, or statistical significance tests are reported. Given the strong claims of surpassing GPT-4o in key domains, this omission prevents assessment of whether differences are reliable.
Authors: We acknowledge that the absence of variance estimates limits the ability to assess statistical reliability of the reported differences. In the revised version, we will clarify in the Results section that evaluations were performed with a single run per model (standard practice for many large-scale VLM benchmarks due to computational cost) and add a discussion of this limitation. Where multiple runs were feasible for smaller subsets, we will report them; otherwise, we will qualify the surpassing claims accordingly without overstating robustness. revision: partial
-
Referee: Training and evaluation details for Kimi-VL-Thinking-2506: The long CoT SFT and RL procedure is described at high level only, with no information on the composition of the long-horizon reasoning data, reward model, or decontamination steps for benchmarks such as MMMU and MathVision. These omissions directly affect interpretability of the reported gains (e.g., 64.0 on MMMU).
Authors: We agree that greater detail on the long CoT SFT and RL training would improve interpretability of the gains for Kimi-VL-Thinking-2506. In the revised manuscript, we will expand the relevant section to include additional information on the composition of the long-horizon reasoning data, the reward model design, and the decontamination procedures applied to benchmarks such as MMMU and MathVision. This will help readers better contextualize the performance numbers. revision: yes
Circularity Check
No circularity: empirical benchmark report with no derivations or self-referential predictions
full rationale
The paper is a technical report describing the Kimi-VL model architecture (MoE VLM with MoonViT encoder), training process (SFT and RL for the Thinking variant), and performance on external benchmarks such as OSWorld, LongVideoBench, MMMU, InfoVQA, and ScreenSpot-Pro. No mathematical derivations, first-principles predictions, or fitted parameters are presented as novel results. All claims rest on reported benchmark scores compared to external models (GPT-4o, Qwen2.5-VL, etc.). There are no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claims to the paper's own inputs. The derivation chain is absent; the work is self-contained as an empirical evaluation report.
Axiom & Free-Parameter Ledger
invented entities (1)
-
MoonViT
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearWe present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model... MoonViT... joint pre-training stages... Long-CoT SFT and RL
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearKimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models... 64.5 on LongVideoBench... 83.2 on InfoVQA
Forward citations
Cited by 59 Pith papers
-
Large Language Models Lack Temporal Awareness of Medical Knowledge
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
-
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
-
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
-
Count Anything at Any Granularity
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
-
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
-
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.
-
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
LLM agents reach only 50.6% accuracy on chemical cost estimation within 25% error even with tools, dropping with noise due to parsing, pack selection, and tool-use failures.
-
Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning
Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...
-
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with stron...
-
QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.
-
FCMBench-Video: Benchmarking Document Video Intelligence
FCMBench-Video is a new benchmark with 1,200 videos and 11k QA instances for evaluating Video-MLLMs on document video understanding across 28 document types.
-
Can Multimodal Large Language Models Truly Understand Small Objects?
Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Discrete Prototypical Memories for Federated Time Series Foundation Models
FeDPM learns and aligns local discrete prototypical memories across domains to create a unified discrete latent space for LLM-based time series foundation models in a federated setting.
-
Token Warping Helps MLLMs Look from Nearby Viewpoints
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
-
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
-
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
-
Dimension-Free Saddle-Point Escape in Muon
Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
-
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.
-
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
-
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts
AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
-
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
-
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
-
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
-
Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
NPUMoE accelerates MoE LLM inference on Apple Silicon NPUs via offline-calibrated static expert tiers, grouped execution, and load-aware graph residency, delivering 1.32x-5.55x lower latency and 1.81x-7.37x better ene...
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
-
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
-
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Omnimodal Dataset Distillation via High-order Proxy Alignment
HoPA captures high-order cross-modal alignments via a shared proxy to enable scalable omnimodal dataset distillation with better performance-compression trade-offs.
-
AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models
AITP is a new multimodal large language model that uses multimodal chain-of-thought and retrieval-augmented generation of legal knowledge to achieve state-of-the-art results on traffic accident responsibility allocati...
-
Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
Muon² adds adaptive second-moment preconditioning to Muon, improving spectrum conditioning for faster orthogonalization, outperforming Muon on GPT and LLaMA pre-training from 60M to 1.3B parameters while cutting Newto...
-
Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts
Multimodal MoE models exhibit 'Seeing but Not Thinking' due to routing distraction where visual inputs fail to activate reasoning experts; a targeted intervention improves results by up to 3.17% across models and benchmarks.
-
Small Vision-Language Models are Smart Compressors for Long Video Understanding
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
-
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
-
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
-
Optimal Projection-Free Adaptive SGD for Matrix Optimization
Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.
-
Let ViT Speak: Generative Language-Image Pre-training
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
Class-specific diffusion models improve military object detection in a low-data domain
Class-specific diffusion models fine-tuned on 8-24 real images per class generate synthetic data that improves military vehicle detection by up to 8% mAP50 in low-data regimes, with further gains from ControlNet edge ...
-
UniMesh: Unifying 3D Mesh Understanding and Generation
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
-
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi...
-
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.