pith. machine review for the scientific record. sign in

arxiv: 2507.20534 · v2 · submitted 2025-07-28 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Kimi K2: Open Agentic Intelligence

Kimi Team: Yifan Bai , Yiping Bao , Y. Charles , Cheng Chen , Guanduo Chen , Haiting Chen , Huarong Chen , Jiahao Chen
show 189 more authors
Ningxin Chen Ruijue Chen Yanru Chen Yuankun Chen Yutian Chen Zhuofu Chen Jialei Cui Hao Ding Mengnan Dong Angang Du Chenzhuang Du Dikang Du Yulun Du Yu Fan Yichen Feng Kelin Fu Bofei Gao Chenxiao Gao Hongcheng Gao Peizhong Gao Tong Gao Yuyao Ge Shangyi Geng Qizheng Gu Xinran Gu Longyu Guan Haiqing Guo Jianhang Guo Xiaoru Hao Tianhong He Weiran He Wenyang He Yunjia He Chao Hong Hao Hu Yangyang Hu Zhenxing Hu Weixiao Huang Zhiqi Huang Zihao Huang Tao Jiang Zhejun Jiang Xinyi Jin Yongsheng Kang Guokun Lai Cheng Li Fang Li Haoyang Li Ming Li Wentao Li Yang Li Yanhao Li Yiwei Li Zhaowei Li Zheming Li Hongzhan Lin Xiaohan Lin Zongyu Lin Chengyin Liu Chenyu Liu Hongzhang Liu Jingyuan Liu Junqi Liu Liang Liu Shaowei Liu T.Y. Liu Tianwei Liu Weizhou Liu Yangyang Liu Yibo Liu Yiping Liu Yue Liu Zhengying Liu Enzhe Lu Haoyu Lu Lijun Lu Yashuo Luo Shengling Ma Xinyu Ma Yingwei Ma Shaoguang Mao Jie Mei Xin Men Yibo Miao Siyuan Pan Yebo Peng Ruoyu Qin Zeyu Qin Bowen Qu Zeyu Shang Lidong Shi Shengyuan Shi Feifan Song Jianlin Su Zhengyuan Su Lin Sui Xinjie Sun Flood Sung Yunpeng Tai Heyi Tang Jiawen Tao Qifeng Teng Chaoran Tian Chensi Wang Dinglu Wang Feng Wang Hailong Wang Haiming Wang Jianzhou Wang Jiaxing Wang Jinhong Wang Shengjie Wang Shuyi Wang Si Wang Xinyuan Wang Yao Wang Yejie Wang Yiqin Wang Yuxin Wang Yuzhi Wang Zhaoji Wang Zhengtao Wang Zhexu Wang Chu Wei Qianqian Wei Haoning Wu Wenhao Wu Xingzhe Wu Yuxin Wu Chenjun Xiao Jin Xie Xiaotong Xie Weimin Xiong Boyu Xu Jinjing Xu L.H. Xu Lin Xu Suting Xu Weixin Xu Xinran Xu Yangchuan Xu Ziyao Xu Jing Xu Junjie Yan Yuzi Yan Hao Yang Xiaofei Yang Yi Yang Ying Yang Zhen Yang Zhilin Yang Zonghan Yang Haotian Yao Xingcheng Yao Wenjie Ye Zhuorui Ye Bohong Yin Longhui Yu Enming Yuan Hongbang Yuan Mengjie Yuan Siyu Yuan Haobing Zhan Dehao Zhang Hao Zhang Wanlu Zhang Xiaobin Zhang Yadong Zhang Yangkun Zhang Yichi Zhang Yizhi Zhang Yongting Zhang Yu Zhang Yutao Zhang Yutong Zhang Zheng Zhang Haotian Zhao Yikai Zhao Zijia Zhao Huabin Zheng Shaojie Zheng Longguang Zhong Jianren Zhou Xinyu Zhou Zaida Zhou Jinguo Zhu Zhen Zhu Weiyu Zhuang Xinxing Zu
Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords large language modelsmixture of expertsagentic intelligencereinforcement learningsoftware engineeringpre-training stabilityopen-source modelspost-training pipeline
0
0 comments X

The pith

Kimi K2 is a 1-trillion-parameter open MoE model that leads non-thinking models on agentic and software engineering benchmarks through stable pre-training and environment-based post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Kimi K2 as a Mixture-of-Experts language model built to advance agentic intelligence in open-source settings. It describes pre-training on 15.5 trillion tokens with a custom optimizer that avoids loss spikes, followed by multi-stage post-training that synthesizes agentic data and applies joint reinforcement learning through interactions with real and synthetic environments. If the approach works as described, open models could deliver competitive performance in practical tasks like autonomous coding and reasoning without extended thinking steps. A sympathetic reader would care because the base and post-trained checkpoints are released for others to use and build upon.

Core claim

Kimi K2 is a Mixture-of-Experts large language model with 32 billion activated parameters and 1 trillion total parameters. It reaches state-of-the-art results among open-source non-thinking models on agentic tasks by pre-training on 15.5 trillion tokens using the MuonClip optimizer with zero loss spikes and then applying a multi-stage post-training process that includes large-scale agentic data synthesis and joint reinforcement learning with environments. This yields strong performance in coding, mathematics, and reasoning without extended thinking.

What carries the argument

The MuonClip optimizer, which adds a QK-clip technique to Muon for training stability while preserving token efficiency, together with the multi-stage post-training pipeline that combines agentic data synthesis and joint reinforcement learning through real and synthetic environment interactions.

If this is right

  • Open-source models can reach high levels of performance in software engineering and agentic tasks without closed-source resources or extended reasoning.
  • The release of both base and post-trained checkpoints allows the community to continue research on agentic intelligence.
  • Large-scale pre-training on trillions of tokens can proceed stably using the described optimizer technique.
  • Agentic capabilities strengthen when models interact with environments during reinforcement learning stages.
  • Strong results in multilingual coding benchmarks follow from the same training process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Releasing the model weights may let independent groups combine or fine-tune Kimi K2 for new agentic applications faster than single-lab development allows.
  • If the stable training method works at this scale, it could simplify hyperparameter choices for future trillion-parameter pre-training runs.
  • Success on multilingual software benchmarks points to a route for building coding tools that work across languages without separate models for each.

Load-bearing premise

The reported benchmark scores reflect genuine generalizable agentic and coding ability rather than optimization specific to those test sets.

What would settle it

Running the released checkpoints on new, previously unseen agentic and coding benchmarks to check whether performance remains at the claimed leading level among open non-thinking models.

read the original abstract

We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual -- surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 4 minor

Summary. The paper introduces Kimi K2, a Mixture-of-Experts LLM with 32B activated parameters and 1T total parameters. It describes the MuonClip optimizer (an extension of Muon with QK-clip for stability), pre-training on 15.5 trillion tokens with zero loss spikes, and multi-stage post-training involving large-scale agentic data synthesis and joint RL with real/synthetic environments. The model is reported to achieve SOTA results among open-source non-thinking models on agentic and coding benchmarks (Tau2-Bench 66.1, ACEBench En 76.5, SWE-Bench Verified 65.8, SWE-Bench Multilingual 47.3, LiveCodeBench v6 53.7, AIME 2025 49.5, GPQA-Diamond 75.1, OJBench 27.1) without extended thinking, with both base and post-trained checkpoints released.

Significance. If the empirical results hold under independent verification, the work is significant for releasing a competitive open-source model strong in agentic, software engineering, and reasoning tasks, closing some of the gap with closed models in non-thinking settings. The model release directly enables reproducibility and further research on agentic intelligence. The MuonClip optimizer is presented as a practical contribution for stable large-scale training, though its isolated impact requires more evidence.

major comments (2)
  1. [Pre-training description] Pre-training section: The assertion of training on 15.5 trillion tokens with 'zero loss spike' using MuonClip is stated without any loss curves, stability metrics, or ablation comparisons to baseline optimizers (e.g., AdamW or standard Muon). This detail is load-bearing for claims about the optimizer's effectiveness and the overall training narrative, even if final benchmark scores are the primary result.
  2. [Results and benchmarks] Evaluation and results sections: Reported benchmark scores (e.g., Tau2-Bench 66.1, SWE-Bench Verified 65.8) lack accompanying details on exact evaluation protocols, prompting formats, temperature settings, or error bars from multiple runs. While model release allows verification of the numbers themselves, the absence of these elements weakens assessment of robustness and generalizability versus potential test-set optimization.
minor comments (4)
  1. [Abstract and §1] The abstract and introduction would benefit from a clearer statement of the total vs. activated parameter count and how the MoE architecture is configured (e.g., number of experts, routing details).
  2. [Post-training] Post-training description mentions 'joint reinforcement learning (RL) stage' but provides no specifics on the RL algorithm, reward model, or environment interaction details, which would aid reproducibility.
  3. [Tables and figures] Figure and table captions could be expanded to include exact benchmark versions, baselines compared, and whether results are from the base or post-trained model.
  4. [Discussion or new section] The paper should include a limitations section addressing potential data contamination risks for the reported benchmarks, given the scale of pre-training data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address the two major comments point by point below, agreeing that additional transparency will strengthen the manuscript. We will incorporate the requested details in the revised version.

read point-by-point responses
  1. Referee: [Pre-training description] Pre-training section: The assertion of training on 15.5 trillion tokens with 'zero loss spike' using MuonClip is stated without any loss curves, stability metrics, or ablation comparisons to baseline optimizers (e.g., AdamW or standard Muon). This detail is load-bearing for claims about the optimizer's effectiveness and the overall training narrative, even if final benchmark scores are the primary result.

    Authors: We agree that the pre-training stability claim would benefit from supporting evidence. Full ablations at 1T-parameter scale are computationally prohibitive and were not performed, but we will add a pre-training loss curve figure in the revised manuscript (or appendix) to demonstrate the absence of spikes across the 15.5T tokens. We will also briefly describe the QK-clip mechanism and its observed effect on gradient norms during development runs to provide context for the optimizer's contribution. revision: yes

  2. Referee: [Results and benchmarks] Evaluation and results sections: Reported benchmark scores (e.g., Tau2-Bench 66.1, SWE-Bench Verified 65.8) lack accompanying details on exact evaluation protocols, prompting formats, temperature settings, or error bars from multiple runs. While model release allows verification of the numbers themselves, the absence of these elements weakens assessment of robustness and generalizability versus potential test-set optimization.

    Authors: We accept this point and will expand the evaluation section in the revision. We will include a table or subsection specifying prompting formats, temperature (typically 0.0 for deterministic agentic/coding benchmarks), top-p, and other sampling parameters for each reported score. We will also note that the results reflect single runs on standard benchmarks and that the open release of both base and post-trained checkpoints enables independent multi-run verification and statistical analysis by the community. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims only

full rationale

The paper's central claims are direct empirical benchmark scores (e.g., 66.1 on Tau2-Bench, 65.8 on SWE-Bench Verified) obtained by running the released model checkpoints under standard evaluation protocols. The MuonClip optimizer is introduced as a practical training technique with a QK-clip modification, but no derivation, equation, or prediction reduces by construction to fitted parameters, self-referential normalizations, or prior self-citations. Training statements such as 'zero loss spike' on 15.5 trillion tokens are factual process descriptions, not outputs derived from the model's own equations or ansatzes. No uniqueness theorems, load-bearing self-citations, or renamed known results are invoked to support the performance claims, which remain independently verifiable by third parties using the released weights.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical training runs and benchmark evaluations of a new model; no new mathematical axioms, free parameters fitted to the target result, or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 6409 in / 1097 out tokens · 58350 ms · 2026-05-10T17:44:12.589435+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

    cs.AR 2026-05 conditional novelty 8.0

    Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

  2. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  3. MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

    cs.LG 2026-05 unverdicted novelty 8.0

    MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.

  4. When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

  5. LLM Translation of Compiler Intermediate Representation

    cs.PL 2026-05 unverdicted novelty 8.0

    IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.

  6. HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

    cs.CR 2026-04 unverdicted novelty 8.0

    Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

  7. ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification

    cs.CL 2026-04 conditional novelty 8.0

    Introduces the ODUTQA-MDC task with a 25k-pair benchmark and MAIC-TQA multi-agent framework for detecting and clarifying underspecified open-domain tabular questions via dialogue.

  8. GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

    cs.CY 2026-05 unverdicted novelty 7.0

    A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.

  9. LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving

    cs.IR 2026-05 conditional novelty 7.0

    LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.

  10. LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving

    cs.IR 2026-05 conditional novelty 7.0

    LeanSearch v2 recovers 46.1% of ground-truth premise groups on research-level Mathlib theorems and raises fixed-loop proof success from 4% to 20% via embedding-reranker plus iterative sketch-retrieve-reflect retrieval.

  11. CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

    cs.CV 2026-05 unverdicted novelty 7.0

    CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...

  12. Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

  13. Beyond Position Bias: Shifting Context Compression from Position-Driven to Semantic-Driven

    cs.CL 2026-05 unverdicted novelty 7.0

    SeCo performs semantic-driven context compression for LLMs by anchoring on query-relevant semantic centers and applying consistency-weighted token merging, yielding better downstream performance, lower latency, and st...

  14. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  15. CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

  16. When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    cs.PF 2026-05 conditional novelty 7.0

    Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.

  17. When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

    cs.PF 2026-05 unverdicted novelty 7.0

    Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...

  18. TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks

    cs.CV 2026-05 unverdicted novelty 7.0

    TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.

  19. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

    cs.CL 2026-05 unverdicted novelty 7.0

    OralMLLM-Bench is a new benchmark with 27 tasks in four cognitive categories that evaluates six MLLMs on dental radiographs and shows clear performance gaps versus clinicians.

  20. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice

    cs.CL 2026-05 unverdicted novelty 7.0

    OralMLLM-Bench reveals performance gaps between multimodal large language models and clinicians on cognitive tasks for dental radiographic analysis across periapical, panoramic, and cephalometric images.

  21. Improving Vision-language Models with Perception-centric Process Reward Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

  22. OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

    cs.CL 2026-04 unverdicted novelty 7.0

    OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...

  23. FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

    cs.DC 2026-04 unverdicted novelty 7.0

    FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.

  24. GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

    cs.CL 2026-04 conditional novelty 7.0

    GTA-2 benchmark shows frontier models achieve below 50% on atomic tool tasks and only 14.39% success on realistic long-horizon workflows, with execution harnesses like Manus providing substantial gains.

  25. TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models

    cs.AI 2026-04 unverdicted novelty 7.0

    TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.

  26. AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning

    cs.IR 2026-04 unverdicted novelty 7.0

    A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.

  27. E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning

    cs.SE 2026-04 unverdicted novelty 7.0

    E2E-REME outperforms nine LLMs in accuracy and efficiency for end-to-end microservice remediation by using experience-simulation reinforcement fine-tuning on a new benchmark called MicroRemed.

  28. Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

  29. Beyond Compliance: A Resistance-Informed Motivation Reasoning Framework for Challenging Psychological Client Simulation

    cs.AI 2026-04 unverdicted novelty 7.0

    ResistClient creates more realistic challenging client simulators by combining resistance theory with supervised fine-tuning on a new dataset followed by process-supervised reinforcement learning for motivation reasoning.

  30. Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

    cs.AI 2026-04 unverdicted novelty 7.0

    A mid-sized LLM buyer trained with RL from verifiable economic rewards learns sophisticated negotiation tactics and extracts more surplus than frontier models over 10x larger.

  31. SAGE: A Service Agent Graph-guided Evaluation Benchmark

    cs.AI 2026-04 unverdicted novelty 7.0

    SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...

  32. Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

    q-bio.QM 2026-04 unverdicted novelty 7.0

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...

  33. Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

    cs.CL 2026-04 unverdicted novelty 7.0

    OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.

  34. Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

    cs.AI 2026-04 unverdicted novelty 7.0

    Plan-RewardBench is a trajectory-level preference benchmark that evaluates how well reward models distinguish preferred agent trajectories from hard distractors across safety refusal, tool handling, complex planning, ...

  35. Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

    cs.AI 2026-04 unverdicted novelty 7.0

    Plan-RewardBench is a trajectory-level preference benchmark that shows existing reward models, including LLM judges, perform poorly on long-horizon agent trajectories in tool-using scenarios across safety, planning, a...

  36. An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

    cs.AI 2026-04 unverdicted novelty 7.0

    An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...

  37. PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

    cs.AI 2026-04 unverdicted novelty 7.0

    PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

  38. BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs

    cs.CL 2026-04 unverdicted novelty 7.0

    BOSCH decomposes attention-head selection for short-context hybridization into layer probing, adaptive ratio assignment, and grouped binary optimization, yielding better efficiency-performance tradeoffs than static or...

  39. DeonticBench: A Benchmark for Reasoning over Rules

    cs.CL 2026-04 unverdicted novelty 7.0

    DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.

  40. AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.

  41. Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    DEMASK adds a lightweight pairwise-dependency predictor to dLLMs and uses greedy selection to enable parallel unmasking whose total-variation error is provably bounded under sub-additivity.

  42. MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

    cs.CL 2026-04 conditional novelty 7.0

    Math-PT provides 1,729 native Portuguese math problems and shows frontier LLMs perform well on multiple-choice but drop on figures and open-ended items.

  43. Think Anywhere in Code Generation

    cs.SE 2026-03 unverdicted novelty 7.0

    Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.

  44. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    cs.LG 2026-01 unverdicted novelty 7.0

    A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...

  45. Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

    cs.CL 2026-05 unverdicted novelty 6.0

    Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.

  46. MinT: Managed Infrastructure for Training and Serving Millions of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.

  47. EMO: Frustratingly Easy Progressive Training of Extendable MoE

    cs.LG 2026-05 unverdicted novelty 6.0

    EMO progressively expands the expert pool in MoE models during training to match fixed-expert performance with improved wall-clock efficiency.

  48. Context Training with Active Information Seeking

    cs.CL 2026-05 unverdicted novelty 6.0

    Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...

  49. MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...

  50. Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    TABOM models inference unmasking preferences as a Boltzmann distribution over predictive entropies and derives a ranking loss to align DLM training with observed trajectories, yielding gains in new domains and reduced...

  51. Selective Off-Policy Reference Tuning with Plan Guidance

    cs.AI 2026-05 unverdicted novelty 6.0

    SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

  52. Likelihood scoring for continuations of mathematical text: a self-supervised benchmark with tests for shortcut vulnerabilities

    cs.LG 2026-05 unverdicted novelty 6.0

    A new benchmark uses separate predictor and scorer LLMs to test whether forecast strings improve likelihood of hidden mathematical equation continuations, with controls that detect priming shortcuts.

  53. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

    cs.AI 2026-05 unverdicted novelty 6.0

    ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.

  54. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  55. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

  56. SecureForge: Finding and Preventing Vulnerabilities in LLM-Generated Code via Prompt Optimization

    cs.CR 2026-05 unverdicted novelty 6.0

    SecureForge audits LLM code for vulnerabilities, builds a synthetic prompt corpus via Markovian sampling, and optimizes system prompts to cut security issues by up to 48% while preserving unit test performance, with z...

  57. OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

    cs.LG 2026-05 unverdicted novelty 6.0

    OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...

  58. WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation

    cs.CR 2026-05 unverdicted novelty 6.0

    WebTrap uses multi-step instruction fusion and context-grounded generation to stealthily hijack browser agents mid-navigation while preserving original task success.

  59. HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

    cs.LG 2026-05 unverdicted novelty 6.0

    HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...

  60. Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment

    cs.CV 2026-05 conditional novelty 6.0

    Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · cited by 121 Pith papers · 32 internal anchors

  1. [1]

    Jacob Austin et al.Program Synthesis with Large Language Models. 2021. arXiv: 2108.07732 [cs.PL]. URL:https://arxiv.org/abs/2108.07732

  2. [2]

    Yushi Bai et al.LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. 2025. arXiv:2412.15204 [cs.CL].URL:https://arxiv.org/abs/2412.15204

  3. [3]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    Victor Barres et al. τ2-Bench: Evaluating Conversational Agents in a Dual-Control Environment. 2025. arXiv: 2506.07982 [cs.AI].URL:https://arxiv.org/abs/2506.07982

  4. [4]

    Biderman, H

    Stella Biderman et al. “Lessons from the trenches on reproducible evaluation of language models”. In:arXiv preprint arXiv:2405.14782(2024)

  5. [5]

    Greg Brockman et al.OpenAI Gym. 2016. arXiv: 1606.01540 [cs.LG].URL: https://arxiv.org/ abs/1606.01540

  6. [6]

    MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Gen- eration

    Federico Cassano et al. “MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Gen- eration”. In:IEEE Transactions on Software Engineering49.7 (2023), pp. 3675–3691.DOI: 10.1109/TSE. 2023.3267446

  7. [7]

    ACEBench: Who Wins the Match Point in Tool Learning?

    Chen Chen et al. “ACEBench: Who Wins the Match Point in Tool Learning?” In:arXiv e-prints(2025), arXiv– 2501

  8. [8]

    Evaluating Large Language Models Trained on Code

    Mark Chen et al. “Evaluating Large Language Models Trained on Code”. In: (2021). arXiv: 2107.03374 [cs.LG]

  9. [9]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark et al. “Think you have solved question answering? try arc, the ai2 reasoning challenge”. In:arXiv preprint arXiv:1803.05457(2018)

  10. [10]

    Karl Cobbe et al.Training Verifiers to Solve Math Word Problems. 2021. arXiv: 2110.14168 [cs.LG].URL: https://arxiv.org/abs/2110.14168

  11. [11]

    DeepSeek-AI.DeepSeek-V3 Technical Report. 2024. arXiv: 2412 . 19437 [cs.CL].URL: https : / / arxiv.org/abs/2412.19437

  12. [12]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani et al. “Scaling vision transformers to 22 billion parameters”. In:International conference on machine learning. PMLR. 2023, pp. 7480–7512

  13. [13]

    Guanting Dong et al.Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models. 2024. arXiv: 2406 . 13542 [cs.CL].URL: https : / / arxiv . org / abs / 2406 . 13542

  14. [14]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

    Xinrun Du et al. “Supergpqa: Scaling llm evaluation across 285 graduate disciplines”. In:arXiv preprint arXiv:2502.14739(2025)

  15. [15]

    DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

    Dheeru Dua et al. “DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Para- graphs”. In:CoRRabs/1903.00161 (2019). arXiv: 1903.00161.URL: http://arxiv.org/abs/1903. 00161

  16. [16]

    Kazuki Fujii et al.Rewriting Pre-Training Data Boosts LLM Performance in Math and Code. 2025. arXiv: 2505.02881 [cs.LG].URL:https://arxiv.org/abs/2505.02881

  17. [17]

    Paul Gauthier.Aider LLM Leaderboards.https://aider.chat/docs/leaderboards/. 2025

  18. [18]

    Are we done with mmlu? CoRR, abs/2406.04127,

    Aryo Pradipta Gema et al. “Are we done with mmlu?” In:arXiv preprint arXiv:2406.04127(2024)

  19. [19]

    CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    Alex Gu et al. “Cruxeval: A benchmark for code reasoning, understanding and execution”. In:arXiv preprint arXiv:2401.03065(2024)

  20. [20]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning”. In:arXiv preprint arXiv:2501.12948(2025)

  21. [21]

    Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large lan- guage models.arXiv preprint arXiv:2403.07714,

    Zhicheng Guo et al. “StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models”. In:arXiv preprint arXiv:2403.07714(2025)

  22. [22]

    PipeDream: Fast and Efficient Pipeline Parallel DNN Training

    Aaron Harlap et al. “Pipedream: Fast and efficient pipeline parallel dnn training”. In:arXiv preprint arXiv:1806.03377(2018)

  23. [23]

    Chinese simpleqa: A chinese factuality evaluation for large language models, 2024a

    Y He et al. “Chinese simpleqa: A chinese factuality evaluation for large language models, 2024a”. In:URL https://arxiv. org/abs/2411.07140()

  24. [24]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks et al. “Measuring massive multitask language understanding”. In:arXiv preprint arXiv:2009.03300(2020)

  25. [25]

    Dan Hendrycks et al.Measuring Mathematical Problem Solving With the MATH Dataset. 2021. arXiv: 2103. 03874 [cs.LG].URL:https://arxiv.org/abs/2103.03874

  26. [26]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu et al. “Minicpm: Unveiling the potential of small language models with scalable training strategies”. In:arXiv preprint arXiv:2404.06395(2024). 21 Kimi K2TECHNICALREPORT

  27. [27]

    Huang, S

    Jiaxin Huang et al. “Large language models can self-improve”. In:arXiv preprint arXiv:2210.11610(2022)

  28. [28]

    Siming Huang et al.OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models. 2025. arXiv: 2411.04905 [cs.CL].URL:https://arxiv.org/abs/2411.04905

  29. [29]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang et al. “Gpipe: Efficient training of giant neural networks using pipeline parallelism”. In:Advances in neural information processing systems32 (2019)

  30. [30]

    Yuzhen Huang et al.C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

  31. [31]

    arXiv:2305.08322 [cs.CL].URL:https://arxiv.org/abs/2305.08322

  32. [32]

    Alon Jacovi et al.The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input. 2025. arXiv: 2501.03200 [cs.CL] .URL: https://arxiv.org/abs/2501. 03200

  33. [33]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain et al. “Livecodebench: Holistic and contamination free evaluation of large language models for code”. In:arXiv preprint arXiv:2403.07974(2024)

  34. [34]

    SWE-bench: Can Language Models Resolve Real-world Github Issues?

    Carlos E Jimenez et al. “SWE-bench: Can Language Models Resolve Real-world Github Issues?” In:The Twelfth International Conference on Learning Representations. 2024.URL: https://openreview.net/forum? id=VTF8yNQM66

  35. [35]

    2024.URL: https : / / kellerjordan.github.io/posts/muon/

    Keller Jordan et al.Muon: An optimizer for hidden layers in neural networks. 2024.URL: https : / / kellerjordan.github.io/posts/muon/

  36. [36]

    Mandar Joshi et al.TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

  37. [37]

    arXiv:1705.03551 [cs.CL].URL:https://arxiv.org/abs/1705.03551

  38. [38]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi Team. “Kimi k1. 5: Scaling reinforcement learning with llms”. In:arXiv preprint arXiv:2501.12599(2025)

  39. [39]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In:3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015.URL: http://arxiv.org/abs/1412.6980

  40. [40]

    Satyapriya Krishna et al.Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

  41. [41]

    arXiv:2409.12941 [cs.CL].URL:https://arxiv.org/abs/2409.12941

  42. [42]

    Breadth-first pipeline parallelism

    Joel Lamy-Poirier. “Breadth-first pipeline parallelism”. In:Proceedings of Machine Learning and Systems5 (2023), pp. 48–67

  43. [43]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin et al. “Gshard: Scaling giant models with conditional computation and automatic sharding”. In: arXiv preprint arXiv:2006.16668(2020)

  44. [44]

    Haonan Li et al.CMMLU: Measuring massive multitask language understanding in Chinese. 2024. arXiv: 2306.09212 [cs.CL].URL:https://arxiv.org/abs/2306.09212

  45. [45]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions

    Jia Li et al. “Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions”. In:Hugging Face repository13.9 (2024), p. 9

  46. [46]

    Gonzalez, and Ion Stoica

    Tianle Li et al. “From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline”. In:arXiv preprint arXiv:2406.11939(2024)

  47. [47]

    Bill Yuchen Lin et al.ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning. 2025. arXiv: 2502. 01100 [cs.AI].URL:https://arxiv.org/abs/2502.01100

  48. [48]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu et al. “Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model”. In: arXiv preprint arXiv:2405.04434(2024)

  49. [49]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu et al. “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation”. In:Advances in Neural Information Processing Systems36 (2023), pp. 21558–21572

  50. [50]

    Muon is Scalable for LLM Training

    Jingyuan Liu et al. “Muon is scalable for LLM training”. In:arXiv preprint arXiv:2502.16982(2025)

  51. [51]

    Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency

    Ziming Liu et al. “Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency”. In:Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’23. ACM, Nov. 2023, pp. 1–13.DOI: 10 . 1145 / 3581784 . 3607073.URL: http://dx.doi.org/10.1145/3581784.3607073

  52. [52]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. “Decoupled Weight Decay Regularization”. In:International Conference on Learning Representations. 2019.URL:https://openreview.net/forum?id=Bkg6RiCqY7

  53. [53]

    Pratyush Maini et al.Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling. 2024. arXiv:2401.16380 [cs.CL].URL:https://arxiv.org/abs/2401.16380

  54. [54]

    Swe-lancer: Can frontier llms earn $1 million from real world freelance software engineering?

    Samuel Miserendino et al. “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?” In:arXiv preprint arXiv:2502.12115(2025)

  55. [55]

    Agentin- struct: Toward generative teaching with agentic flows,

    Arindam Mitra et al. “Agentinstruct: Toward generative teaching with agentic flows”. In:arXiv preprint arXiv:2407.03502(2024). 22 Kimi K2TECHNICALREPORT

  56. [56]

    Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset

    Ivan Moshkov et al. “Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset”. In:arXiv preprint arXiv:2504.16891(2025)

  57. [57]

    Efficient large-scale language model training on gpu clusters using megatron-lm

    Deepak Narayanan et al. “Efficient large-scale language model training on gpu clusters using megatron-lm”. In: Proceedings of the international conference for high performance computing, networking, storage and analysis. 2021, pp. 1–15

  58. [58]

    Training language models to follow instructions with human feedback

    Long Ouyang et al. “Training language models to follow instructions with human feedback”. In:Advances in neural information processing systems35 (2022), pp. 27730–27744

  59. [59]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint arXiv:2309.00071(2023)

  60. [60]

    Long Phan et al.Humanity’s Last Exam. 2025. arXiv: 2501.14249 [cs.LG] .URL: https://arxiv. org/abs/2501.14249

  61. [61]

    Zero bubble pipeline parallelism.arXiv preprint arXiv:2401.10241, 2023

    Penghui Qi et al. “Zero bubble pipeline parallelism”. In:arXiv preprint arXiv:2401.10241(2023)

  62. [62]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin et al. “Toolllm: Facilitating large language models to master 16000+ real-world apis”. In:arXiv preprint arXiv:2307.16789(2023)

  63. [63]

    Qwen et al.Qwen2.5 Technical Report. 2025. arXiv: 2412.15115 [cs.CL] .URL: https://arxiv. org/abs/2412.15115

  64. [64]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari et al. “Zero: Memory optimizations toward training trillion parameter models”. In:SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. 2020, pp. 1–16

  65. [65]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein et al. “Gpqa: A graduate-level google-proof q&a benchmark”. In:First Conference on Language Modeling. 2024

  66. [66]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi et al. “Winogrande: An adversarial winograd schema challenge at scale”. In:Communications of the ACM64.9 (2021), pp. 99–106

  67. [67]

    Welcome to the era of experience

    David Silver and Richard S Sutton. “Welcome to the era of experience”. In:Google AI1 (2025)

  68. [68]

    Ved Sirdeshmukh et al.MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs. 2025. arXiv: 2501 . 17399 [cs.CL].URL: https : / / arxiv . org / abs / 2501 . 17399

  69. [69]

    arXiv:2504.01848 , year =

    Giulio Starace et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research”. In:arXiv preprint arXiv:2504.01848(2025)

  70. [70]

    Hao Sun et al.ZeroSearch: Incentivize the Search Capability of LLMs without Searching. 2025. arXiv: 2505. 04588 [cs.CL].URL:https://arxiv.org/abs/2505.04588

  71. [71]

    Mirac Suzgun et al.Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. 2022. arXiv: 2210.09261 [cs.CL].URL:https://arxiv.org/abs/2210.09261

  72. [72]

    Tamber, F.S

    Manveer Singh Tamber et al. “Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards”. In:arXiv preprint arXiv:2505.04847(2025)

  73. [73]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team et al. “Gemma 2: Improving open language models at a practical size”. In:arXiv preprint arXiv:2408.00118(2024)

  74. [74]

    https://ai.meta.com/blog/llama-4-multimodal-intelligence/

    LlaMA Team.The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation — ai.meta.com. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. [Accessed 15-07-2025]

  75. [75]

    The Terminal-Bench Team.Terminal-Bench: A Benchmark for AI Agents in Terminal Environments. Apr. 2025. URL:https://github.com/laude-institute/terminal-bench

  76. [76]

    Attention is All you Need

    Ashish Vaswani et al. “Attention is All you Need”. In:Advances in Neural Information Processing Systems. Ed. by I. Guyon et al. V ol. 30. Curran Associates, Inc., 2017.URL: https://proceedings.neurips. cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  77. [77]

    2024.URL: https://huggingface.co/ vectara/hallucination_evaluation_model

    Vectara.Hallucination Evaluation Model (Revision 7437011). 2024.URL: https://huggingface.co/ vectara/hallucination_evaluation_model

  78. [78]

    In Advances in Neural Information Processing Systems

    Joshua Vendrow et al. “Do large language model benchmarks test reliability?” In:arXiv preprint arXiv:2502.03461(2025)

  79. [79]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang et al. “Self-instruct: Aligning language models with self-generated instructions”. In:arXiv preprint arXiv:2212.10560(2022)

  80. [80]

    Yubo Wang et al.MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Showing first 80 references.