super hub Canonical reference

Kimi K2: Open Agentic Intelligence

Cheng Chen, Guanduo Chen, Haiting Chen, Kimi Team: Yifan Bai, Y. Charles, Yiping Bao · 2025 · cs.LG · arXiv 2507.20534

Canonical reference. 77% of citing Pith papers cite this work as background.

268 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 268 citing papers more from Cheng Chen arXiv PDF

abstract

We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual -- surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 47 baseline 6 method 4 dataset 2 other 1

citation-polarity summary

background 46 baseline 6 use method 4 unclear 2 use dataset 2

claims ledger

abstract We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model imp

authors

Cheng Chen Guanduo Chen Haiting Chen Kimi Team: Yifan Bai Y. Charles Yiping Bao

co-cited works

representative citing papers

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

cs.AI · 2026-06-02 · unverdicted · novelty 8.0

MedCUA-Bench provides 18 clinical scenarios in 10 domains as a testbed for computer-use agents on medical UIs, with evaluations of 23 agents showing low success rates especially on real systems like OpenEMR.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

LLM Translation of Compiler Intermediate Representation

cs.PL · 2026-05-07 · unverdicted · novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification

cs.CL · 2026-04-11 · conditional · novelty 8.0

Introduces the ODUTQA-MDC task with a 25k-pair benchmark and MAIC-TQA multi-agent framework for detecting and clarifying underspecified open-domain tabular questions via dialogue.

Evidence-State Rewards for Long-Context Reasoning

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

Maven is an RL method using answer-conditioned evidence-state values to assign rewards to add, link, and drop actions on evidence memory, outperforming outcome-only baselines on LongBench v2, LongReason, and RULER.

Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Static SFT and RL training for tool-use agents leads to performance drops under open-world distributional shifts across perception, interaction, reasoning and internalization; perturbation-augmented fine-tuning is proposed as mitigation.

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

CLI-Universe synthesizes a verified 6K dataset of terminal-agent tasks that, when used to fine-tune Qwen3-32B, reaches 33.4% on Terminal-Bench 2.0 and sets a new open-source SOTA for models at or below 32B parameters.

Vibe Calibration: Autonomous Bring-up of a 112-Qubit Superconducting Quantum Processor by a Skill-Orchestrating Language Agent

quant-ph · 2026-06-21 · unverdicted · novelty 7.0

Vibe Calibration uses LLM agents to orchestrate reusable decision-tree Skills distilled from expert knowledge, autonomously calibrating 108/112 qubits in 4.7 hours with 4-5x speedup and transferable workflows.

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

cs.CL · 2026-06-20 · unverdicted · novelty 7.0

BioMatrix unifies sequences, structures, and language for molecules and proteins inside one decoder-only foundation model via shared discrete tokens and achieves SOTA or competitive results on 77 of 80 downstream tasks.

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

cs.AI · 2026-06-19 · unverdicted · novelty 7.0

Counsel is a new dataset of LLM-generated process critiques on agent benchmarks paired with human labels on error location and reasoning quality, achieving 0.78 Krippendorff alpha.

Towards Understanding the Power and Limits of the Muon Optimizer: A River-Valley Perspective

cs.LG · 2026-06-19 · unverdicted · novelty 7.0

Muon moves faster along signal river directions early but converges slower or oscillates near optima than GD due to orthogonal updates removing scale information, supporting two-stage optimization.

GraphPO: Graph-based Policy Optimization for Reasoning Models

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

GraphPO represents reasoning rollouts as a DAG to merge semantically equivalent paths, share suffixes, and assign separate efficiency and correctness advantages for lower variance and better performance than chain or tree baselines.

Harnessing Routing Foresight for Micro-step-level MoE load balancing in RL Post-training

cs.DC · 2026-06-10 · unverdicted · novelty 7.0

ForeMoE uses routing foresight from the rollout stage to enable micro-step load balancing in MoE RL post-training via a hierarchical planner and transfer engine, claiming up to 1.45x speedup on 64 GPUs.

Why Muon Outperforms Adam: A Curvature Perspective

cs.LG · 2026-06-03 · conditional · novelty 7.0

Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Introduces LongJudgeBench benchmark showing LLM judges remain unstable for long-form output evaluation even with rubrics or references.

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

cs.DC · 2026-05-31 · unverdicted · novelty 7.0

On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

MM-Snowball benchmark diagnoses hallucination snowballing in multi-turn MLLM dialogues; CAVR mitigates it via dual visual rectification at representation and logit levels.

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

MineExplorer is a new benchmark for MLLM agents' open-world exploration in Minecraft, using task filtering, ReAct formulation, and multi-agent synthesis to create reliable multi-hop instances.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

citing papers explorer

Showing 49 of 49 citing papers after filters.

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents cs.AI · 2026-06-02 · unverdicted · none · ref 27 · internal anchor
MedCUA-Bench provides 18 clinical scenarios in 10 domains as a testbed for computer-use agents on medical UIs, with evaluations of 23 agents showing low success rates especially on real systems like OpenEMR.
Evidence-State Rewards for Long-Context Reasoning cs.AI · 2026-07-02 · unverdicted · none · ref 3 · internal anchor
Maven is an RL method using answer-conditioned evidence-state values to assign rewards to add, link, and drop actions on evidence memory, outperforming outcome-only baselines on LongBench v2, LongReason, and RULER.
Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use cs.AI · 2026-07-01 · unverdicted · none · ref 51 · internal anchor
Static SFT and RL training for tool-use agents leads to performance drops under open-world distributional shifts across perception, interaction, reasoning and internalization; perturbation-augmented fine-tuning is proposed as mitigation.
CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents cs.AI · 2026-06-22 · unverdicted · none · ref 13 · internal anchor
CLI-Universe synthesizes a verified 6K dataset of terminal-agent tasks that, when used to fine-tune Qwen3-32B, reaches 33.4% on Terminal-Bench 2.0 and sets a new open-source SOTA for models at or below 32B parameters.
Counsel: A Meta-Evaluation Dataset for Agentic Tasks cs.AI · 2026-06-19 · unverdicted · none · ref 6 · internal anchor
Counsel is a new dataset of LLM-generated process critiques on agent benchmarks paired with human labels on error location and reasoning quality, achieving 0.78 Krippendorff alpha.
PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.
TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models cs.AI · 2026-04-16 · unverdicted · none · ref 9 · internal anchor
TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.
Beyond Compliance: A Resistance-Informed Motivation Reasoning Framework for Challenging Psychological Client Simulation cs.AI · 2026-04-12 · unverdicted · none · ref 5 · internal anchor
ResistClient creates more realistic challenging client simulators by combining resistance theory with supervised fine-tuning on a new dataset followed by process-supervised reinforcement learning for motivation reasoning.
Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards cs.AI · 2026-04-10 · unverdicted · none · ref 1 · internal anchor
A mid-sized LLM buyer trained with RL from verifiable economic rewards learns sophisticated negotiation tactics and extracts more surplus than frontier models over 10x larger.
SAGE: A Service Agent Graph-guided Evaluation Benchmark cs.AI · 2026-04-10 · unverdicted · none · ref 48 · internal anchor
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling cs.AI · 2026-04-09 · unverdicted · none · ref 6 · 2 links · internal anchor
Plan-RewardBench is a trajectory-level preference benchmark that evaluates how well reward models distinguish preferred agent trajectories from hard distractors across safety refusal, tool handling, complex planning, and error recovery tasks.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks cs.AI · 2026-04-09 · unverdicted · none · ref 13 · internal anchor
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.
PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent cs.AI · 2026-04-08 · unverdicted · none · ref 17 · internal anchor
PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents cs.AI · 2026-04-03 · unverdicted · none · ref 24 · internal anchor
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
Evaluating the Search Agent in a Parallel World cs.AI · 2026-03-05 · unverdicted · none · ref 19 · internal anchor
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping decisions.
Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks cs.AI · 2025-12-27 · accept · none · ref 3 · internal anchor
TravelBench is a new benchmark with three subtasks and ten cached real-world tools to evaluate LLM agents on realistic multi-turn travel planning and capability boundaries.
A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents cs.AI · 2025-12-23 · unverdicted · none · ref 20 · internal anchor
A new benchmark of 40 scenarios finds state-of-the-art LLMs exhibit outcome-driven constraint violations in 0-62.8% of cases under KPI pressure, with no consistent safety gains across model generations.
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators cs.AI · 2025-11-05 · unverdicted · none · ref 12 · internal anchor
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
Safety Testing LLM Agents at Scale: From Risk Discovery to Evidence-Grounded Verification cs.AI · 2026-07-02 · conditional · none · ref 44 · internal anchor
Vera automates safety testing for LLM agents via literature-driven risk taxonomies, combinatorial case generation, and evidence-grounded verification in isolated environments, showing 93.9% average attack success on four frameworks.
ACE: Pluggable Adaptive Context Elasticizer across Agents cs.AI · 2026-06-30 · unverdicted · none · ref 2 · internal anchor
ACE is a pluggable module that elastically orchestrates historical agent steps as raw, abstract, or dropped to maintain compact yet recoverable context for LLM agents handling long trajectories.
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning cs.AI · 2026-06-10 · unverdicted · none · ref 27 · internal anchor
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents cs.AI · 2026-06-07 · unverdicted · none · ref 1 · internal anchor
VESTA creates 1,072 automated safety scenarios across five risk dimensions and reports an average 47.1% attack success rate on 12 LLM agents under two authority settings.
Benchmark Everything Everywhere All at Once cs.AI · 2026-06-04 · unverdicted · none · ref 18 · internal anchor
Benchmark Agent is an autonomous agentic system that constructs benchmarks for LLMs and MLLMs via query analysis, subtask design, annotation and quality control, yielding 15 benchmarks with minimal human input.
Reasoning Structure of Large Language Models cs.AI · 2026-06-02 · unverdicted · none · ref 2 · internal anchor
Introduces a logic puzzle benchmark, a pipeline to build verifiable reasoning graphs from traces, and a concentration-based efficiency metric to distinguish model reasoning behaviors.
SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision cs.AI · 2026-05-31 · unverdicted · none · ref 15 · internal anchor
SkillRevise iteratively refines initial LLM-generated agent skills using execution traces to diagnose defects and apply repairs, raising success rates from 36.05% to 61.63% on SkillsBench across three benchmarks and five LLMs.
Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling cs.AI · 2026-05-28 · unverdicted · none · ref 25 · internal anchor
GDCR assigns step-level rewards via distance to the answer node in a training-time ER graph and SAPO combines these with trajectory advantages for credit assignment in agentic search.
The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF cs.AI · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
DistractionIF benchmark reveals inverse scaling in LLM robustness to distractors in reference text, with GRPO RL as a mitigation.
AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios cs.AI · 2026-05-27 · unverdicted · none · ref 2 · internal anchor
AsyncTool is a new benchmark for evaluating LLM agents' asynchronous tool calling in multi-task scenarios with simulated response latency and efficiency metrics.
DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents cs.AI · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
DynaSchedBench calibrates DFJSP benchmarks via SESC and SSI, revealing an observability paradox and limited gains from LLM agents over heuristics in dynamic scheduling.
Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching cs.AI · 2026-05-25 · unverdicted · none · ref 17 · internal anchor
DecoR routes LLM queries by decomposing them into capability dimensions and matching to historical examples, yielding higher accuracy and lower inference costs than direct-mapping routers on both in-distribution and OOD data.
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 28 · internal anchor
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.
Selective Off-Policy Reference Tuning with Plan Guidance cs.AI · 2026-05-12 · unverdicted · none · ref 13 · 2 links · internal anchor
SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox cs.AI · 2026-05-11 · unverdicted · none · ref 32 · 2 links · internal anchor
ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 136 · internal anchor
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism cs.AI · 2026-05-03 · unverdicted · none · ref 17 · internal anchor
PAT adaptively reconfigures tensor parallelism in RLHF generation using predictor-guided decisions and lightweight state updates, cutting generation latency by up to 34.6%.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration cs.AI · 2026-04-20 · unverdicted · none · ref 50 · internal anchor
LLM agents trained with a task-success reward on self-generated knowledge can spontaneously explore and adapt to new environments without any rewards or instructions at inference, yielding 20% gains on web tasks and allowing a 14B model to beat Gemini-2.5-Flash.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks cs.AI · 2026-04-20 · conditional · none · ref 60 · internal anchor
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards cs.AI · 2026-04-11 · unverdicted · none · ref 18 · internal anchor
Introduces MemHome benchmark and RL with multi-dimensional rewards for memory-driven smart home device control.
Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation cs.AI · 2026-03-18 · unverdicted · none · ref 4 · internal anchor
Safety degradation in large reasoning models occurs only after chain-of-thought is enabled; adding pre-CoT safety signals from a BERT classifier on safe models improves safety while preserving reasoning ability.
No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning cs.AI · 2026-01-11 · unverdicted · none · ref 14 · internal anchor
ECHO jointly optimizes policy and critic via co-evolution, cascaded rollouts, and saturation-aware shaping to deliver non-stale feedback and higher success in open-world LLM agent RL.
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments cs.AI · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
NoisyAgent trains LLM agents with controlled user and tool noise to improve robustness in stochastic environments while also boosting clean-benchmark performance.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length cs.AI · 2026-05-04 · unverdicted · none · ref 39 · internal anchor
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models cs.AI · 2026-04-18 · unverdicted · none · ref 22 · internal anchor
MCPO fixes vanishing training signals and shrinking weights in GRPO by using a hinge-KL regularizer on mastered prompts and prioritizing majority-correct prompts, yielding higher pass@1 and pass@k on math tasks.
Mind DeepResearch Technical Report cs.AI · 2026-04-16 · unverdicted · none · ref 33 · internal anchor
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning cs.AI · 2025-09-02 · conditional · none · ref 65 · internal anchor
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch cs.AI · 2026-06-17 · unverdicted · none · ref 32 · internal anchor
ProfiLLM deploys tool-augmented LLM agents to generate reusable global knowledge and utility-selected user profiles, delivering up to 6.14% AUC lift and measurable GMV gains in DiDi's live dispatcher.
MAVEN: Improving Generalization in Agentic Tool Calling cs.AI · 2026-05-29 · unverdicted · none · ref 13 · internal anchor
MAVEN is a modular verification scaffold that lifts an open 120b model's tool-calling accuracy from 48% to 71% on MAVEN-Bench without retraining.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence cs.AI · 2026-05-07 · unverdicted · none · ref 45 · 2 links · internal anchor
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
Agentic Reasoning for Large Language Models cs.AI · 2026-01-18 · unverdicted · none · ref 243 · internal anchor
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.

Kimi K2: Open Agentic Intelligence

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer