Recognition: 2 theorem links
· Lean TheoremKimi K2.5: Visual Agentic Intelligence
Pith reviewed 2026-05-10 16:02 UTC · model grok-4.3
The pith
Kimi K2.5 reaches state-of-the-art results in coding vision reasoning and agentic tasks by jointly optimizing text and vision then adding a parallel Agent Swarm.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kimi K2.5 shows that joint optimization of text and vision modalities through pre-training, zero-vision SFT, and reinforcement learning, when combined with the Agent Swarm orchestration framework for dynamic decomposition and concurrent execution of heterogeneous sub-tasks, produces state-of-the-art performance across coding, vision, reasoning, and agentic domains while cutting latency by up to 4.5 times relative to single-agent baselines.
What carries the argument
Agent Swarm, a self-directed parallel agent orchestration framework that decomposes complex tasks into heterogeneous sub-problems and executes them concurrently.
If this is right
- Complex multimodal tasks can be solved more efficiently by decomposing them into parallel heterogeneous sub-problems rather than sequential single-agent processing.
- The open-source release of the post-trained checkpoint allows direct experimentation and extension by the research community.
- Joint text-vision training enables each modality to improve the other, supporting stronger performance on tasks that require both seeing and reasoning.
- Latency reductions of this magnitude make real-time agentic applications more feasible without sacrificing capability.
Where Pith is reading between the lines
- The swarm approach may transfer to embodied settings such as robotics where vision, planning, and action must be coordinated under time constraints.
- Releasing the checkpoint could speed up community development of open multimodal agents that outperform closed single-model systems.
- Future scaling experiments could test whether the latency advantage holds when the underlying base model is made substantially larger.
Load-bearing premise
The claimed state-of-the-art scores and latency reductions are produced by the joint text-vision techniques and Agent Swarm rather than by model scale, data choices, or evaluation details not described in the paper.
What would settle it
A side-by-side evaluation of Kimi K2.5 with and without the Agent Swarm component or the joint text-vision training steps that shows the full model no longer matches the reported state-of-the-art scores or the 4.5 times latency reduction.
read the original abstract
We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Kimi K2.5, an open-source multimodal agentic model that jointly optimizes text and vision modalities via pre-training, zero-vision SFT, and joint RL. It further proposes Agent Swarm, a self-directed parallel orchestration framework for decomposing and executing heterogeneous sub-tasks concurrently. The manuscript claims state-of-the-art results across coding, vision, reasoning, and agentic tasks, plus up to 4.5× latency reduction relative to single-agent baselines, and releases the post-trained model checkpoint.
Significance. If the empirical claims hold, the work would advance multimodal agentic systems by demonstrating synergistic text-vision training and a practical parallel-agent framework, with the open checkpoint release providing a concrete resource for reproducibility and follow-on research. The emphasis on joint optimization and dynamic task decomposition addresses real deployment constraints in agentic intelligence.
major comments (2)
- [Abstract] Abstract: The manuscript asserts SOTA performance across multiple domains and a 4.5× latency reduction from Agent Swarm, yet supplies no benchmark scores, named baselines, dataset specifications, error bars, or quantitative tables anywhere in the text. This absence prevents verification of the central attribution that the listed techniques (joint pre-training, zero-vision SFT, joint RL, Agent Swarm) causally produce the gains rather than unreported differences in scale or data.
- [Evaluation and Methodology sections] Evaluation and Methodology sections: No ablation studies, controls for model size, or comparisons isolating each component are provided. Without these, the causal link between the described joint text-vision methods / Agent Swarm and the headline outcomes cannot be established, rendering the performance claims unverifiable from the manuscript.
minor comments (2)
- The description of Agent Swarm would benefit from pseudocode or a formal algorithmic outline to clarify its self-directed decomposition and concurrency mechanics.
- The open release of the post-trained checkpoint is a clear strength that supports community follow-up.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript asserts SOTA performance across multiple domains and a 4.5× latency reduction from Agent Swarm, yet supplies no benchmark scores, named baselines, dataset specifications, error bars, or quantitative tables anywhere in the text. This absence prevents verification of the central attribution that the listed techniques (joint pre-training, zero-vision SFT, joint RL, Agent Swarm) causally produce the gains rather than unreported differences in scale or data.
Authors: We agree that the current abstract does not contain specific quantitative results. In the revised manuscript we will expand the abstract to include key benchmark scores, named baselines, dataset references, and the reported latency reduction, with explicit pointers to the evaluation tables. We will also ensure the main text contains complete tables with scores, error bars, and dataset specifications so that the performance claims can be directly verified. revision: yes
-
Referee: [Evaluation and Methodology sections] Evaluation and Methodology sections: No ablation studies, controls for model size, or comparisons isolating each component are provided. Without these, the causal link between the described joint text-vision methods / Agent Swarm and the headline outcomes cannot be established, rendering the performance claims unverifiable from the manuscript.
Authors: We acknowledge that the submitted manuscript lacks ablation studies and component-isolation experiments. In the revision we will add a dedicated ablation section that reports results for each training stage (joint pre-training, zero-vision SFT, joint RL) and for Agent Swarm versus single-agent baselines, including controls that hold model size and data scale constant. These additions will directly address the causal attribution of the observed gains. revision: yes
Circularity Check
No circularity: empirical performance claims with no derivations or self-referential reductions
full rationale
The paper presents Kimi K2.5 as an empirical multimodal model whose claims rest on reported benchmark performance (SOTA across coding/vision/reasoning/agentic tasks) and a latency reduction (up to 4.5× via Agent Swarm). No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. Techniques such as joint text-vision pre-training and Agent Swarm are introduced as design choices whose effects are asserted via external evaluation rather than constructed internally. The central attribution to these techniques is therefore not circular by construction; it is simply an empirical claim whose strength depends on unreported ablations and controls, not on definitional equivalence to the inputs.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Agent Swarm
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclearExtensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to 4.5× over single-agent baselines.
Forward citations
Cited by 60 Pith papers
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild
WildTableBench is the first benchmark for multimodal models on naturally occurring table images, with only one of 21 tested models exceeding 50% accuracy and most ranging from 4.1% to 49.9%.
-
Can Coding Agents Reproduce Findings in Computational Materials Science?
AutoMat benchmark shows current LLM coding agents achieve at most 54.1% success when reproducing computational materials science claims from papers.
-
From Context to Skills: Can Language Models Learn from Context Skillfully?
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
-
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
-
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
-
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
-
FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data
FashionMV introduces product-level multi-view CIR, a 127K-product dataset built via automated LMM pipeline, and a 0.8B ProCIR model that beats larger baselines on three fashion benchmarks.
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four tasks showing MLLM capability gaps that improve via supervised fine-tuning.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
Chronicles-OCR: A Cross-Temporal Perception Benchmark for the Evolutionary Trajectory of Chinese Characters
Chronicles-OCR is the first benchmark with 2,800 images across the complete evolutionary trajectory of Chinese characters, defining four tasks to evaluate VLLMs' cross-temporal visual perception.
-
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that all tested LLMs emit harmful stereotypes in open-ended stories, with associations adapting to prompt language and targeting locally salient groups rather than transferring uniformly across languages.
-
StereoTales: A Multilingual Framework for Open-Ended Stereotype Discovery in LLMs
StereoTales shows that LLMs produce harmful, culturally adapted stereotypes in open-ended multilingual stories, with patterns consistent across providers and aligned human-LLM harm judgments.
-
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
-
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.
-
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
OPHSD uses harness-augmented models as teachers to distill reasoning capabilities into base LLMs, yielding strong standalone performance on classification and math tasks.
-
VibeProteinBench: An Evaluation Benchmark for Language-interfaced Vibe Protein Design
VibeProteinBench is a three-stage language-interfaced benchmark revealing that no current LLM performs strongly across recognition, engineering, and generation of proteins.
-
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes uses a dual-grained RL framework with parallel tool actions and efficiency rewards to achieve 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source multimodal agents.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
-
On Time, Within Budget: Constraint-Driven Online Resource Allocation for Agentic Workflows
MCPP is a Monte Carlo simulation-based online planner that improves the probability of agentic workflows completing successfully under explicit budget and deadline constraints compared to baselines on CodeFlow and Pro...
-
AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries
AffectSeek is an agentic framework that localizes affective moments, classifies emotions, and generates rationales in long videos under vague user queries, backed by the new VQAU-Bench benchmark.
-
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
-
MolViBench: Evaluating LLMs on Molecular Vibe Coding
MolViBench is the first benchmark designed to evaluate LLMs on generating executable programs for molecular tasks in drug discovery.
-
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
-
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
-
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
-
MathDuels: Evaluating LLMs as Problem Posers and Solvers
Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
-
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or t...
-
S-GRPO: Unified Post-Training for Large Vision-Language Models
S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
-
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
-
Many-Tier Instruction Hierarchy in LLM Agents
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
-
ClawBench: Can AI Agents Complete Everyday Online Tasks?
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
-
Learning to Interrupt in Language-based Multi-agent Communication
HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, sche...
-
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
-
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents
EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
-
MMORF: A Multi-agent Framework for Designing Multi-objective Retrosynthesis Planning Systems
MMORF provides a modular multi-agent framework for multi-objective retrosynthesis planning, with MASIL and RFAS systems showing strong safety, cost, and success metrics on a new 218-task benchmark.
-
Self-Distilled RLVR
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
-
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
-
HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models
HumorRank ranks nine LLMs on textual humor using GTVH-grounded pairwise tournaments and Adaptive Swiss aggregation on the SemEval-2026 MWAHAHA dataset, finding that comedic mechanism mastery matters more than scale.
-
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Missing old logits in async agentic RL entangle discrepancy and staleness terms in PPO off-policy correction; exact acquisition methods and revised PPO-EWMA restore decoupled updates with reported gains in speed and p...
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and ...
-
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training lo...
-
From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs
LogiHard hardens reasoning benchmarks by transforming 0-order selection into 2-order judgment, causing 31-56% accuracy drops in 12 frontier LLMs and a 47% drop on zero-shot MMLU, revealing a combinatorial reasoning ga...
-
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Reference graph
Works this paper leans on
-
[1]
2025.URL:https://moonshotai.github.io/Kimi- K2/thinking.html
Moonshot AI.Introducing Kimi K2 Thinking. 2025.URL:https://moonshotai.github.io/Kimi- K2/thinking.html
work page 2025
-
[2]
2025.URL: https://moonshotai.github.io/Kimi-Researcher/
Moonshot AI.Kimi-Researcher End-to-End RL Training for Emerging Agentic Capabilities. 2025.URL: https://moonshotai.github.io/Kimi-Researcher/
work page 2025
-
[3]
Amazon Web Services.Amazon Simple Storage Service (Amazon S3). Web. Available at:https://aws. amazon.com/s3/. 2023.URL:https://aws.amazon.com/s3/(visited on 12/15/2023)
work page 2023
-
[4]
Mathematical Association of America.2025 American Invitational Mathematics Examination I. Held on Febru- ary 6, 2025. 2025.URL:https://artofproblemsolving.com/wiki/index.php/2025_AIME_ I
work page 2025
-
[5]
2026.URL:https://claude.com/ blog/building-multi-agent-systems-when-and-how-to-use-them
Anthropic.Building multi-agent systems: when and how to use them. 2026.URL:https://claude.com/ blog/building-multi-agent-systems-when-and-how-to-use-them
work page 2026
-
[6]
2025.URL:https : / / www - cdn
Anthropic.Claude Opus 4.5 System Card. 2025.URL:https : / / www - cdn . anthropic . com / bf10f64990cfda0ba858290be7b8cc6317685f47.pdf
work page 2025
-
[7]
2025.URL:https://www.anthropic.com/ engineering/multi-agent-research-system
Anthropic.How we built our multi-agent research system. 2025.URL:https://www.anthropic.com/ engineering/multi-agent-research-system
work page 2025
-
[8]
Shuai Bai et al.Qwen3-VL Technical Report. 2025. arXiv:2511 . 21631 [cs.CV].URL:https : / / arxiv.org/abs/2511.21631
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [9]
-
[10]
Greg Brockman et al.OpenAI Gym. 2016. arXiv:1606.01540 [cs.LG].URL:https://arxiv.org/ abs/1606.01540
work page internal anchor Pith review arXiv 2016
-
[11]
Language Models are Few-Shot Learners
Tom B. Brown et al.Language Models are Few-Shot Learners. 2020. arXiv:2005.14165 [cs.CL].URL: https://arxiv.org/abs/2005.14165
work page internal anchor Pith review Pith/arXiv arXiv 2020
- [12]
-
[13]
Xianfu Cheng et al.SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
- [14]
-
[15]
DeepSeek-AI et al.DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models. 2025. arXiv:2512. 02556 [cs.CL].URL:https://arxiv.org/abs/2512.02556
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [16]
-
[17]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Xiang Deng et al. “SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?” In: arXiv preprint arXiv:2509.16941(2025)
work page internal anchor Pith review arXiv 2025
-
[18]
Chaoyou Fu et al.Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. 2025. arXiv:2405.21075 [cs.CV].URL:https://arxiv.org/abs/2405.21075
work page internal anchor Pith review arXiv 2025
- [19]
-
[20]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre et al. “Datacomp: In search of the next generation of multimodal datasets”. In:Advances in Neural Information Processing Systems36 (2024)
work page 2024
-
[21]
2025.URL:https://deepmind.google/models/gemini/pro/
Google.Gemini 3 Pro. 2025.URL:https://deepmind.google/models/gemini/pro/
work page 2025
-
[22]
Dong Guo et al.Seed1.5-VL Technical Report. 2025. arXiv:2505 . 07062 [cs.CV].URL:https : / / arxiv.org/abs/2505.07062
work page internal anchor Pith review arXiv 2025
-
[23]
Lukas Haas et al.SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
- [24]
- [25]
-
[26]
Wenyi Hong et al.MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models. 2025. arXiv:2501 . 02955 [cs.CV].URL:https : / / arxiv . org / abs / 2501.02955
work page internal anchor Pith review arXiv 2025
-
[27]
Kairui Hu et al.Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
-
[28]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
arXiv:2501.13826 [cs.CV].URL:https://arxiv.org/abs/2501.13826. 16 Kimi K2.5TECHNICALREPORT
work page internal anchor Pith review arXiv
- [29]
-
[30]
Yanping Huang et al.GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. 2019. arXiv:1811.06965 [cs.CV].URL:https://arxiv.org/abs/1811.06965
work page Pith review arXiv 2019
-
[31]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain et al. “Livecodebench: Holistic and contamination free evaluation of large language models for code”. In:arXiv preprint arXiv:2403.07974(2024)
work page internal anchor Pith review arXiv 2024
-
[32]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez et al. “Swe-bench: Can language models resolve real-world github issues?” In:arXiv preprint arXiv:2310.06770(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
2024.URL:https : / / kellerjordan.github.io/posts/muon/
Keller Jordan et al.Muon: An optimizer for hidden layers in neural networks. 2024.URL:https : / / kellerjordan.github.io/posts/muon/
work page 2024
-
[34]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team. “Kimi k1. 5: Scaling reinforcement learning with llms”. In:arXiv preprint arXiv:2501.12599 (2025)
work page internal anchor Pith review arXiv 2025
-
[35]
Obelics: An open web-scale filtered dataset of interleaved image-text documents
Hugo Laurençon et al. “Obelics: An open web-scale filtered dataset of interleaved image-text documents”. In: Advances in Neural Information Processing Systems36 (2024)
work page 2024
-
[36]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin et al. “Gshard: Scaling giant models with conditional computation and automatic sharding”. In:arXiv preprint arXiv:2006.16668(2020)
work page internal anchor Pith review arXiv 2006
-
[37]
Muon is Scalable for LLM Training
Jingyuan Liu et al. “Muon is Scalable for LLM Training”. In:arXiv preprint arXiv:2502.16982(2025)
work page internal anchor Pith review arXiv 2025
-
[38]
OCRBench: on the hidden mystery of OCR in large multimodal models
Yuliang Liu et al. “OCRBench: on the hidden mystery of OCR in large multimodal models”. In:Science China Information Sciences67.12 (Dec. 2024).ISSN: 1869-1919.DOI:10.1007/s11432-024-4235-6.URL: http://dx.doi.org/10.1007/s11432-024-4235-6
-
[39]
Pan Lu et al.MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts. 2024. arXiv:2310.02255 [cs.CV].URL:https://arxiv.org/abs/2310.02255
work page internal anchor Pith review arXiv 2024
-
[40]
Towards Robust Mathematical Reasoning
Thang Luong et al. “Towards Robust Mathematical Reasoning”. In:Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Ed. by Christos Christodoulopoulos et al. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 35418–35442.ISBN: 979-8-89176-332-6.DOI:10. 18653 / v1 / 2025 . emnlp - main . 1794.URL:htt...
work page 2025
- [41]
-
[42]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A Merrill et al. “Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces”. In:arXiv preprint arXiv:2601.11868(2026)
work page internal anchor Pith review arXiv 2026
- [43]
-
[44]
2025.URL:https://openai.com/index/introducing- gpt- 5- 2/
OpenAI.Introducing GPT 5.2. 2025.URL:https://openai.com/index/introducing- gpt- 5- 2/
work page 2025
- [45]
- [46]
-
[47]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng et al. “Yarn: Efficient context window extension of large language models”. In:arXiv preprint arXiv:2309.00071(2023)
work page internal anchor Pith review arXiv 2023
-
[48]
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham et al.SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models. Seal-0 is the main subset of this benchmark. 2025. arXiv:2506.01062 [cs.CL].URL:https://arxiv.org/ abs/2506.01062
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Long Phan et al.Humanity’s Last Exam. 2025. arXiv:2501.14249 [cs.LG].URL:https://arxiv. org/abs/2501.14249
work page internal anchor Pith review arXiv 2025
-
[50]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein et al. “Gpqa: A graduate-level google-proof q&a benchmark”. In:First Conference on Language Modeling. 2024
work page 2024
- [51]
-
[52]
Laion-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann et al. “Laion-5b: An open large-scale dataset for training next generation image-text models”. In:Advances in Neural Information Processing Systems35 (2022), pp. 25278–25294
work page 2022
-
[53]
Proximal Policy Optimization Algorithms
John Schulman et al. “Proximal Policy Optimization Algorithms”. In:arXiv preprint arXiv:1707.06347(2017). URL:https://arxiv.org/abs/1707.06347. 17 Kimi K2.5TECHNICALREPORT
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [54]
-
[55]
Paperbench: Evaluating ai’s ability to replicate ai research, 2025
Giulio Starace et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research”. In:arXiv preprint arXiv:2504.01848(2025)
-
[56]
Kimi K2: Open Agentic Intelligence
Kimi Team et al. “Kimi k2: Open agentic intelligence”. In:arXiv preprint arXiv:2507.20534(2025)
work page internal anchor Pith review arXiv 2025
-
[57]
Kimi Team et al. “Kimi-vl technical report”. In:arXiv preprint arXiv:2504.07491(2025)
work page internal anchor Pith review arXiv 2025
-
[58]
Longcat-flash-omni technical report.ArXiv, abs/2511.00279,
Meituan LongCat Team et al. “Longcat-flash-omni technical report”. In:arXiv preprint arXiv:2511.00279 (2025)
-
[59]
Scicode: A research coding benchmark curated by scientists
Minyang Tian et al. “Scicode: A research coding benchmark curated by scientists”. In:Advances in Neural Information Processing Systems37 (2024), pp. 30624–30650
work page 2024
- [60]
-
[61]
Harvard-MIT Mathematics Tournament.Harvard-MIT Mathematics Tournament, February 2025. Held on February 15, 2025. 2025.URL:https://www.hmmt.org/www/archive/282
work page 2025
-
[62]
Ashish Vaswani et al. “Attention is All you Need”. In:Advances in Neural Information Processing Systems. Ed. by I. Guyon et al. V ol. 30. Curran Associates, Inc., 2017.URL:https://proceedings.neurips. cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
work page 2017
-
[63]
Nikhita Vedula et al.DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents. 2025.URL:https : / / storage . googleapis . com / deepmind - media / DeepSearchQA / DeepSearchQA_benchmark_paper.pdf
work page 2025
- [64]
- [65]
- [66]
-
[67]
Yubo Wang et al.MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Bench- mark. 2024. arXiv:2406.01574 [cs.CL].URL:https://arxiv.org/abs/2406.01574
work page internal anchor Pith review arXiv 2024
-
[68]
OJBench: A Competition Level Code Benchmark For Large Language Models
Zhexu Wang et al. “OJBench: A Competition Level Code Benchmark For Large Language Models”. In:arXiv preprint arXiv:2506.16395(2025)
-
[69]
Zhun Wang et al. “CyberGym: Evaluating AI Agents’ Cybersecurity Capabilities with Real-World Vulnerabili- ties at Scale”. In:arXiv preprint arXiv:2506.02548(2025)
- [70]
-
[71]
Jason Wei et al.BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents. 2025. arXiv:2504. 12516 [cs.CL].URL:https://arxiv.org/abs/2504.12516
work page internal anchor Pith review arXiv 2025
- [72]
- [73]
- [74]
-
[75]
Tianbao Xie et al. “Introducing OSWorld-Verified”. In:xlang.ai(July 2025).URL:https://xlang.ai/ blog/osworld-verified
work page 2025
-
[76]
Tianbao Xie et al.OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Envi- ronments. 2024. arXiv:2404.07972 [cs.AI]
work page internal anchor Pith review arXiv 2024
-
[77]
Feng Yao et al.Your Efficient RL Framework Secretly Brings You Off-Policy RL Training. Aug. 2025.URL: https://fengyao.notion.site/off-policy-rl
work page 2025
-
[78]
Xiang Yue et al.MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. 2025. arXiv:2409.02813 [cs.CL].URL:https://arxiv.org/abs/2409.02813
work page internal anchor Pith review arXiv 2025
-
[79]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue et al. “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI”. In:Proceedings of CVPR. 2024
work page 2024
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.