super hub

Kimi K2: Open Agentic Intelligence

Cheng Chen, Guanduo Chen, Haiting Chen, Kimi Team: Yifan Bai, Y. Charles, Yiping Bao · 2025 · cs.LG · arXiv 2507.20534

116 Pith papers cite this work. Polarity classification is still indexing.

116 Pith papers citing it

open full Pith review browse 116 citing papers more from Cheng Chen arXiv PDF

abstract

We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual -- surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

claims ledger

abstract We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model imp

authors

Cheng Chen Guanduo Chen Haiting Chen Kimi Team: Yifan Bai Y. Charles Yiping Bao

co-cited works

representative citing papers

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification

cs.CL · 2026-04-11 · conditional · novelty 8.0

Introduces the ODUTQA-MDC task with a 25k-pair benchmark and MAIC-TQA multi-agent framework for detecting and clarifying underspecified open-domain tabular questions via dialogue.

LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving

cs.IR · 2026-05-13 · conditional · novelty 7.0

LeanSearch v2 recovers 46.1% of ground-truth premise groups on research-level Mathlib theorems and raises fixed-loop proof success from 4% to 20% via embedding-reranker plus iterative sketch-retrieve-reflect retrieval.

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GRPO training with temporal/spatial IoU rewards.

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

SeCo performs semantic-driven context compression for LLMs by anchoring on query-relevant semantic centers and applying consistency-weighted token merging, yielding better downstream performance, lower latency, and stronger out-of-domain robustness than position-based methods across 14 benchmarks.

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interventions that enhance performance.

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

cs.PF · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.

TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.

OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.

Improving Vision-language Models with Perception-centric Process Reward Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.

FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

cs.DC · 2026-04-21 · unverdicted · novelty 7.0

FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

cs.CL · 2026-04-17 · conditional · novelty 7.0

GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.

TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.

AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning

cs.IR · 2026-04-14 · unverdicted · novelty 7.0

A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.

E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

E2E-REME outperforms nine LLMs in accuracy and efficiency for end-to-end microservice remediation by using experience-simulation reinforcement fine-tuning on a new benchmark called MicroRemed.

Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

cs.LG · 2026-04-12 · unverdicted · novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

citing papers explorer

Showing 1 of 1 citing paper after filters.

On the Creativity of AI Agents cs.CY · 2026-04-14 · unverdicted · none · ref 54 · internal anchor
LLM agents produce outputs that meet basic functional criteria for creativity but lack the process-level, social, and personal elements required for ontological creativity.

Kimi K2: Open Agentic Intelligence

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer