Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Bihao Zhan; Bowen Zhou; Boyuan Sun; Bo Zhang; Chen Zhang; Chunjiang Mu; Dahua Lin; Dongrui Liu; Fangchen Yu; Fenghua Ling

arxiv: 2606.30616 · v1 · pith:GHUP7TQ7new · submitted 2026-06-29 · 💻 cs.CL

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Lei Bai , Zongsheng Cao , Yang Chen , Zhiyao Cui , Shangheng Du , Yue Fan , Shiyang Feng , Zijie Guo

show 42 more authors

Haonan He Liang He Xiaohan He Shuyue Hu Yusong Hu Songtao Huang Yichen Jiang Hao Li Xin Li Dahua Lin Weihao Lin Fenghua Ling Dongrui Liu Zhuo Liu Runmin Ma Chunjiang Mu Haoyang Peng Tianshuo Peng Jinxin Shi Luohe Shi Boyuan Sun Zelin Tan Shengji Tang Qianyi Wang Yiming Wu Yi Xie Xiangchao Yan Jingqi Ye Peng Ye Fangchen Yu Jiakang Yuan Bihao Zhan Bo Zhang Chen Zhang Shufei Zhang Shuaiyu Zhang Wenlong Zhang Yiqun Zhang Junpeng Zhao Zhijie Zhong Bowen Zhou Yuhao Zhou

This is my paper

Pith reviewed 2026-06-30 05:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords agent horizon scaling35B MoE modellong-horizon trajectoriesmulti-teacher distillationagent benchmarksknowledge-action infrastructuredomain-routed training

0 comments

The pith

A 35B agent reaches trillion-parameter performance on long-horizon tasks by scaling trajectories instead of model size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a 35 billion parameter Mixture-of-Experts model can match or exceed 1T-parameter models on agent benchmarks by extending the length and diversity of its reasoning paths. It constructs a knowledge-action infrastructure that generates trajectories averaging 45K tokens linking external knowledge, actions, observations, and verifiers. A three-stage training process first aligns the base model through full-domain supervised fine-tuning, then creates specialized domain-level teachers, and finally applies multi-teacher domain-routed on-policy distillation with salient vocabulary alignment to unify six domains in one student model. If this holds, it offers a route to strong agent performance without the compute cost of trillion-scale parameter counts.

Core claim

Agents-A1, a 35B Mixture-of-Experts model trained via a three-stage recipe on long-horizon trajectories averaging 45K tokens, achieves leading scores on SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8) while remaining competitive on SciCode (44.3), HLE (47.6), and BrowseComp (75.5) against 1T models such as Kimi-K2.6 and DeepSeek-V4-pro.

What carries the argument

The long-horizon knowledge-action infrastructure that connects external knowledge, actions, observations, and verifier outcomes to produce representative agentic trajectories.

Load-bearing premise

The generated long-horizon trajectories are assumed to represent real deployment conditions and to transfer across the six domains without overfitting or benchmark leakage.

What would settle it

A new long-horizon agent benchmark constructed after the training data cutoff, with no trajectory overlap, would show whether Agents-A1 maintains its reported advantage over the 1T models.

read the original abstract

We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model that reaches trillion-parameter-level performance by scaling the agent horizon. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and scaling heterogeneous agent abilities. To support this goal, we build a long-horizon knowledge-action infrastructure that connects external knowledge, actions, observations, and verifier outcomes, producing agentic trajectories with an average length of 45K tokens. Based on this, we train Agents-A1 with a three-stage recipe. First, we perform full-domain supervised fine-tuning to align the base model with broad agentic behaviors. Second, we train domain-level teacher models to capture specialized expertise in each domain. Third, we propose a multi-teacher domain-routed on-policy distillation with salient vocabulary alignment to improve knowledge transfer efficiency across different domains, unifying six heterogeneous domains into one deployable student model. Agents-A1 achieves strong and broad performance for long-horizon agent benchmarks. Compared with 1T-parameter model such as Kimi-K2.6 and DeepSeek-V4-pro, Agents-A1 achieves leading results on SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8), and remains highly competitive on SciCode (44.3), HLE (47.6) and BrowseComp (75.5). We hope this work provides the community with a practical path for scaling the horizon using a 35B agent that can reach or match the performance of 1T models on long-horizon tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A 35B model claims 1T-level agent performance via 45k-token trajectories and multi-teacher distillation, but missing leakage checks and ablations make the numbers hard to trust.

read the letter

The main thing to know is that Agents-A1, a 35B MoE, reports matching or beating 1T models like Kimi-K2.6 on several long-horizon benchmarks (SEAL-0 at 56.4, IFBench at 80.6, HiPhO at 46.4) by scaling trajectories to an average 45k tokens and running a three-stage recipe of full-domain SFT, domain-specific teachers, and multi-teacher on-policy distillation with vocabulary alignment.

What is actually new is the concrete infrastructure that wires external knowledge, actions, observations, and verifiers into long trajectories, plus the domain-routed distillation step that collapses six heterogeneous domains into one student model. The paper does a reasonable job spelling out a practical, deployable training path instead of just scaling parameters.

The soft spots are the usual ones for this kind of claim. The text gives benchmark numbers but no evaluation protocol details, no contamination audit on the 45k-token data, no ablations that isolate the horizon component, and no error bars or significance tests. The stress-test concern about possible leakage or domain overfitting therefore stands, because nothing rules it out. Without those controls the central story—that horizon scaling, not data overlap, explains the results—cannot be assessed.

This is for people working on efficient agent training who want a worked example of multi-domain distillation. A reader who needs reliable performance numbers should treat the current version as preliminary. It deserves peer review because the claim matters if it holds and the recipe is specific enough for referees to request the missing checks.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agents-A1, a 35B Mixture-of-Experts agentic model that reaches performance levels comparable to 1T-parameter models on long-horizon tasks by scaling agent horizons rather than parameters. It describes a knowledge-action infrastructure generating trajectories averaging 45K tokens, followed by a three-stage training process (full-domain SFT, domain-specific teachers, and multi-teacher on-policy distillation with vocabulary alignment) that unifies six heterogeneous domains into a single deployable model. The central empirical claim is that Agents-A1 leads or matches 1T models on benchmarks including SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), MolBench-Bind (56.8), while remaining competitive on SciCode, HLE, and BrowseComp.

Significance. If the performance claims hold after proper verification, the work would provide evidence that horizon scaling via long trajectories and multi-domain distillation can be more efficient than parameter scaling for agentic capabilities, offering a practical route to high-performance agents on smaller models. The explicit infrastructure for 45K-token trajectories and the three-stage recipe would constitute reusable contributions if accompanied by sufficient controls and ablations.

major comments (3)

[Abstract] Abstract: The reported benchmark scores (e.g., SEAL-0 56.4, IFBench 80.6) are presented without any description of evaluation protocols, controls for data leakage from the 45K-token trajectories, error bars, or statistical significance testing. This information is load-bearing for the central claim that Agents-A1 matches or exceeds 1T models such as Kimi-K2.6 and DeepSeek-V4-pro.
[Training recipe and infrastructure] Training and infrastructure description: No ablation is reported that isolates the contribution of the long-horizon (45K-token) trajectories from the three-stage recipe or that verifies the trajectories were generated without including or paraphrasing items from the six evaluation benchmarks. Without such controls, the cross-domain generalization claim rests on an unverified assumption.
[Results and comparison] Benchmark comparison: The headline results against 1T models are stated as leading on five benchmarks, yet the manuscript supplies no details on whether the evaluation sets were held out from the knowledge-action infrastructure data or on any contamination audit. This directly affects the validity of the horizon-scaling thesis.

minor comments (2)

[Abstract] The abstract and introduction use the term 'leading results' without defining the precise ranking criteria or listing all competing models evaluated.
[Method] Notation for the multi-teacher distillation step (vocabulary alignment) is introduced at a high level; a concrete equation or pseudocode would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency on evaluation protocols, ablations, and contamination controls. These points are important for strengthening the central claims. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The reported benchmark scores (e.g., SEAL-0 56.4, IFBench 80.6) are presented without any description of evaluation protocols, controls for data leakage from the 45K-token trajectories, error bars, or statistical significance testing. This information is load-bearing for the central claim that Agents-A1 matches or exceeds 1T models such as Kimi-K2.6 and DeepSeek-V4-pro.

Authors: We agree that the abstract lacks these details. In the revision we will expand the evaluation section to describe the protocols used for each benchmark, steps taken to mitigate data leakage from the trajectory data, and any available statistical information. Error bars were not computed owing to the prohibitive cost of repeated full evaluations; we will explicitly note this limitation and discuss observed variance across domains where feasible. revision: yes
Referee: [Training recipe and infrastructure] Training and infrastructure description: No ablation is reported that isolates the contribution of the long-horizon (45K-token) trajectories from the three-stage recipe or that verifies the trajectories were generated without including or paraphrasing items from the six evaluation benchmarks. Without such controls, the cross-domain generalization claim rests on an unverified assumption.

Authors: We acknowledge the absence of a dedicated ablation separating trajectory length from the three-stage recipe. We will add such an ablation comparing shorter- versus full-length trajectories. We will also document the data-generation pipeline and any decontamination procedures applied to ensure the 45K-token trajectories do not contain or paraphrase benchmark items. revision: yes
Referee: [Results and comparison] Benchmark comparison: The headline results against 1T models are stated as leading on five benchmarks, yet the manuscript supplies no details on whether the evaluation sets were held out from the knowledge-action infrastructure data or on any contamination audit. This directly affects the validity of the horizon-scaling thesis.

Authors: We will add a dedicated subsection confirming that all evaluation sets were held out from the knowledge-action infrastructure and describing the contamination audit performed. These details will be placed in the results section to support the reported comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims rest on described training stages without self-referential reductions or load-bearing self-citations

full rationale

The provided manuscript text (abstract plus context) describes a three-stage training recipe (full-domain SFT, domain teachers, multi-teacher distillation) and reports benchmark scores as outcomes of the long-horizon infrastructure. No equations, fitted parameters renamed as predictions, or self-citation chains appear that would make any result equivalent to its inputs by construction. The central claim is an empirical comparison to 1T models; absent any derivation that collapses to the input data or prior self-work, the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no information on free parameters, axioms or invented entities; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 6031 in / 1243 out tokens · 30300 ms · 2026-06-30T05:49:39.484221+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 24 canonical work pages · 16 internal anchors

[1]

Kimi.Kimi K2.6: Advancing Open-Source Coding.https://www.kimi.com/blog/kimi-k2-
[2]

https://openai.com/index/introducing- gpt- 5- 5

OpenAI.Introducing GPT-5.5. https://openai.com/index/introducing- gpt- 5- 5 . 2026

2026
[3]

Anthropic.Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude- opus-4-6. 2026

2026
[4]

Gemini 3 Pro - Google DeepMind.url: https://deepmind.google/models/gemini/ pro/
[5]

DeepSeek-AI.DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. 2026

2026
[6]

Qwen.Qwen3.5: Towards Native Multimodal Agents.https://qwen.ai/blog?id=qwen3.5. 2026

2026
[7]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan et al. “MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering”. In:arXiv preprint arXiv:2410.07095(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

OpenAI.FrontierScience: Evaluating AI’s Ability To Perform Expert-level Scientific Tasks.https: //openai.com/index/frontierscience/. 2026

2026
[9]

Humanity's Last Exam

Long Phan et al. “Humanity’s last exam”. In:arXiv preprint arXiv:2501.14249(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei et al. “Browsecomp: A simple yet challenging benchmark for browsing agents”. In: arXiv preprint arXiv:2504.12516(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng et al. “Glm-5: from vibe coding to agentic engineering”. In:arXiv preprint arXiv:2602.15763(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Towards Long-horizon Agentic Multimodal Search

Yifan Du et al. “Towards Long-horizon Agentic Multimodal Search”. In:arXiv preprint arXiv:2604.12890(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Zhipu AI.GLM-5.2: Built for Long-Horizon Tasks.https://z.ai/blog/glm-5.2. 2026

2026
[14]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team et al. “Kimi K2. 5: Visual Agentic Intelligence”. In:arXiv preprint arXiv:2602.02276 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

arXiv preprint arXiv:2510.11967 , year=

Weiwei Sun et al. “Scaling long-horizon llm agent via context-folding”. In:arXiv preprint arXiv:2510.11967(2025)

work page arXiv 2025
[16]

Cwm: An open-weights llm for research on code generation with world models

Jade Copet et al. “Cwm: An open-weights llm for research on code generation with world models”. In:arXiv preprint arXiv:2510.02387(2025)

work page arXiv 2025
[17]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham et al. “SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models”. In:arXiv preprint arXiv:2506.01062(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Generalizing verifiable instruction following

Valentina Pyatkin et al. “Generalizing verifiable instruction following”. In:Advances in Neural Information Processing Systems38 (2026)

2026
[19]

Hipho: How far are (m) llms from humans in the latest high school physics olympiad benchmark?arXiv preprint arXiv:2509.07894,

Fangchen Yu et al. “HiPhO: How Far Are (M) LLMs from Humans in the Latest High School Physics Olympiad Benchmark?” In:arXiv preprint arXiv:2509.07894(2025)

work page arXiv 2025
[20]

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

Lisheng Zhang et al. “MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization”. In:arXiv preprint arXiv:2604.21937 (2026). 26

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

SciCode: A Research Coding Benchmark Curated by Scientists

Minyang Tian et al. “SciCode: A Research Coding Benchmark Curated by Scientists”. In: Advances in Neural Information Processing Systems. Ed. by A. Globerson et al. Vol. 37. Cur- ran Associates, Inc., 2024, pp. 30624–30650.doi: 10 . 52202 / 079017 - 0963.url: https : / / proceedings . neurips . cc / paper _ files / paper / 2024 / file / 36850592258c8c41cec...

2024
[22]

Agents-K1: Towards Agent-native Knowledge Orchestration

Zongsheng Cao et al. “Agents-K1: Towards Agent-native Knowledge Orchestration”. In:arXiv preprint arXiv:2606.13669(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

On-Policy Distillation

Kevin Lu and Thinking Machines Lab. “On-Policy Distillation”. In:Thinking Machines Lab: Con- nectionism(2025). https://thinkingmachines.ai/blog/on-policy-distillation.doi: 10.64434/ tml.20251026

2025
[24]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu et al. “Revisiting on-policy distillation: Empirical failure modes and simple fixes”. In:arXiv preprint arXiv:2603.25562(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Rushi Qiang et al. “MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering”. In:arXiv preprint arXiv:2505.07782(2025)

work page arXiv 2025
[26]

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Shangheng Du et al. “MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery”. In:arXiv preprint arXiv:2606.06473(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

GitHub repository

NVIDIA.NeMo Gym: An Open Source Library for Scaling Reinforcement Learning Environments for LLM.https://github.com/NVIDIA-NeMo/Gym. GitHub repository. 2025

2025
[28]

WildChat: 1M ChatGPT Interaction Logs in the Wild

Wenting Zhao et al. “WildChat: 1M ChatGPT Interaction Logs in the Wild”. In:The Twelfth International Conference on Learning Representations. 2024.url:https://openreview. net/forum?id=Bl8u7ZRlbM

2024
[29]

Victor Barres et al.𝜏2-Bench: Evaluating Conversational Agents in a Dual-Control Environment
[30]

arXiv:2506.07982 [cs.AI].url:https://arxiv.org/abs/2506.07982

work page internal anchor Pith review Pith/arXiv arXiv
[31]

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real- world Applications

Wei He et al. “VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real- world Applications”. In:arXiv preprint arXiv:2509.26490(2025)

work page arXiv 2025
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models”. In:arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

ZelinTanetal.PAPO:StabilizingRubricIntegrationTrainingviaDecoupledAdvantageNormaliza- tion. 2026. arXiv:2603.26535 [cs.AI].url:https://arxiv.org/abs/2603.26535

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Gaia: a benchmark for general ai assistants

Grégoire Mialon et al. “Gaia: a benchmark for general ai assistants”. In:International Conference on Learning Representations. Vol. 2024. 2024, pp. 9025–9049

2024
[35]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations

Kaiyuan Chen et al. “xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations”. In:arXiv preprint arXiv:2506.13651(2025)

work page arXiv 2025
[36]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai et al. “Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 3639–3664

2025
[37]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou et al. “Instruction-Following Evaluation for Large Language Models”. In:arXiv preprint arXiv:2311.07911(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu et al. “Deepseek-v3. 2: Pushing the frontier of open large language models”. In:arXiv preprint arXiv:2512.02556(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Siyu Liu et al.MatTools: Benchmarking Large Language Models for Materials Science Tools. 2025. arXiv: 2505 . 10852 [cond-mat.mtrl-sci].url: https : / / arxiv . org / abs / 2505 . 10852. 27

2025
[40]

2013.url: https://sabiod.lis-lab.fr/icml2013/challenge_description.html (visited on 06/17/2026)

ICML 2013 Workshop on Machine Learning for Bioacoustics.Challenge Description. 2013.url: https://sabiod.lis-lab.fr/icml2013/challenge_description.html (visited on 06/17/2026)

2013
[41]

The International Best Track Archive for Climate Stewardship (IB- TrACS): Unifying tropical cyclone best track data

Kenneth R. Knapp et al. “The International Best Track Archive for Climate Stewardship (IB- TrACS): Unifying tropical cyclone best track data”. In:Bulletin of the American Meteorological Society(2010).doi:10.1175/2009BAMS2755.1

work page doi:10.1175/2009bams2755.1 2010
[42]

International Best Track Archive for Climate Stewardship (IBTrACS) Project

J. Gahtan et al. “International Best Track Archive for Climate Stewardship (IBTrACS) Project”. In:NOAANationalCentersforEnvironmentalInformation(2024).doi: 10.25921/82ty-9e16

work page doi:10.25921/82ty-9e16 2024
[43]

2008.url:https: //www.metoc.navy.mil/jtwc/products/atcr/2008atcr.pdf

Joint Typhoon Warning Center.Annual Tropical Cyclone Report 2008. 2008.url:https: //www.metoc.navy.mil/jtwc/products/atcr/2008atcr.pdf

2008
[44]

Updated 2025-09-23

NOAANationalCentersforEnvironmentalInformation.IBTrACSv04r01ColumnDocumentation. Updated 2025-09-23. 2025.url: https : / / www . ncei . noaa . gov / sites / default / files/2025-09/IBTrACS_v04r01_column_documentation.pdf. 28 A. Appendix A.1. Contributions and Acknowledgments Knowledge-Action Infrastructure:Zongsheng Cao†, Bihao Zhan, Zhijie Zhong Full-dom...

2025

[1] [1]

Kimi.Kimi K2.6: Advancing Open-Source Coding.https://www.kimi.com/blog/kimi-k2-

[2] [2]

https://openai.com/index/introducing- gpt- 5- 5

OpenAI.Introducing GPT-5.5. https://openai.com/index/introducing- gpt- 5- 5 . 2026

2026

[3] [3]

Anthropic.Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude- opus-4-6. 2026

2026

[4] [4]

Gemini 3 Pro - Google DeepMind.url: https://deepmind.google/models/gemini/ pro/

[5] [5]

DeepSeek-AI.DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. 2026

2026

[6] [6]

Qwen.Qwen3.5: Towards Native Multimodal Agents.https://qwen.ai/blog?id=qwen3.5. 2026

2026

[7] [7]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan et al. “MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering”. In:arXiv preprint arXiv:2410.07095(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

OpenAI.FrontierScience: Evaluating AI’s Ability To Perform Expert-level Scientific Tasks.https: //openai.com/index/frontierscience/. 2026

2026

[9] [9]

Humanity's Last Exam

Long Phan et al. “Humanity’s last exam”. In:arXiv preprint arXiv:2501.14249(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei et al. “Browsecomp: A simple yet challenging benchmark for browsing agents”. In: arXiv preprint arXiv:2504.12516(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng et al. “Glm-5: from vibe coding to agentic engineering”. In:arXiv preprint arXiv:2602.15763(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Towards Long-horizon Agentic Multimodal Search

Yifan Du et al. “Towards Long-horizon Agentic Multimodal Search”. In:arXiv preprint arXiv:2604.12890(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Zhipu AI.GLM-5.2: Built for Long-Horizon Tasks.https://z.ai/blog/glm-5.2. 2026

2026

[14] [14]

Kimi K2.5: Visual Agentic Intelligence

Kimi Team et al. “Kimi K2. 5: Visual Agentic Intelligence”. In:arXiv preprint arXiv:2602.02276 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

arXiv preprint arXiv:2510.11967 , year=

Weiwei Sun et al. “Scaling long-horizon llm agent via context-folding”. In:arXiv preprint arXiv:2510.11967(2025)

work page arXiv 2025

[16] [16]

Cwm: An open-weights llm for research on code generation with world models

Jade Copet et al. “Cwm: An open-weights llm for research on code generation with world models”. In:arXiv preprint arXiv:2510.02387(2025)

work page arXiv 2025

[17] [17]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham et al. “SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models”. In:arXiv preprint arXiv:2506.01062(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Generalizing verifiable instruction following

Valentina Pyatkin et al. “Generalizing verifiable instruction following”. In:Advances in Neural Information Processing Systems38 (2026)

2026

[19] [19]

Hipho: How far are (m) llms from humans in the latest high school physics olympiad benchmark?arXiv preprint arXiv:2509.07894,

Fangchen Yu et al. “HiPhO: How Far Are (M) LLMs from Humans in the Latest High School Physics Olympiad Benchmark?” In:arXiv preprint arXiv:2509.07894(2025)

work page arXiv 2025

[20] [20]

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

Lisheng Zhang et al. “MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization”. In:arXiv preprint arXiv:2604.21937 (2026). 26

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

SciCode: A Research Coding Benchmark Curated by Scientists

Minyang Tian et al. “SciCode: A Research Coding Benchmark Curated by Scientists”. In: Advances in Neural Information Processing Systems. Ed. by A. Globerson et al. Vol. 37. Cur- ran Associates, Inc., 2024, pp. 30624–30650.doi: 10 . 52202 / 079017 - 0963.url: https : / / proceedings . neurips . cc / paper _ files / paper / 2024 / file / 36850592258c8c41cec...

2024

[22] [22]

Agents-K1: Towards Agent-native Knowledge Orchestration

Zongsheng Cao et al. “Agents-K1: Towards Agent-native Knowledge Orchestration”. In:arXiv preprint arXiv:2606.13669(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

On-Policy Distillation

Kevin Lu and Thinking Machines Lab. “On-Policy Distillation”. In:Thinking Machines Lab: Con- nectionism(2025). https://thinkingmachines.ai/blog/on-policy-distillation.doi: 10.64434/ tml.20251026

2025

[24] [24]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu et al. “Revisiting on-policy distillation: Empirical failure modes and simple fixes”. In:arXiv preprint arXiv:2603.25562(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Rushi Qiang et al. “MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering”. In:arXiv preprint arXiv:2505.07782(2025)

work page arXiv 2025

[26] [26]

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Shangheng Du et al. “MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery”. In:arXiv preprint arXiv:2606.06473(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

GitHub repository

NVIDIA.NeMo Gym: An Open Source Library for Scaling Reinforcement Learning Environments for LLM.https://github.com/NVIDIA-NeMo/Gym. GitHub repository. 2025

2025

[28] [28]

WildChat: 1M ChatGPT Interaction Logs in the Wild

Wenting Zhao et al. “WildChat: 1M ChatGPT Interaction Logs in the Wild”. In:The Twelfth International Conference on Learning Representations. 2024.url:https://openreview. net/forum?id=Bl8u7ZRlbM

2024

[29] [29]

Victor Barres et al.𝜏2-Bench: Evaluating Conversational Agents in a Dual-Control Environment

[30] [30]

arXiv:2506.07982 [cs.AI].url:https://arxiv.org/abs/2506.07982

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real- world Applications

Wei He et al. “VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real- world Applications”. In:arXiv preprint arXiv:2509.26490(2025)

work page arXiv 2025

[32] [32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models”. In:arXiv preprint arXiv:2402.03300(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

ZelinTanetal.PAPO:StabilizingRubricIntegrationTrainingviaDecoupledAdvantageNormaliza- tion. 2026. arXiv:2603.26535 [cs.AI].url:https://arxiv.org/abs/2603.26535

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Gaia: a benchmark for general ai assistants

Grégoire Mialon et al. “Gaia: a benchmark for general ai assistants”. In:International Conference on Learning Representations. Vol. 2024. 2024, pp. 9025–9049

2024

[35] [35]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations

Kaiyuan Chen et al. “xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations”. In:arXiv preprint arXiv:2506.13651(2025)

work page arXiv 2025

[36] [36]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai et al. “Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 3639–3664

2025

[37] [37]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou et al. “Instruction-Following Evaluation for Large Language Models”. In:arXiv preprint arXiv:2311.07911(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu et al. “Deepseek-v3. 2: Pushing the frontier of open large language models”. In:arXiv preprint arXiv:2512.02556(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Siyu Liu et al.MatTools: Benchmarking Large Language Models for Materials Science Tools. 2025. arXiv: 2505 . 10852 [cond-mat.mtrl-sci].url: https : / / arxiv . org / abs / 2505 . 10852. 27

2025

[40] [40]

2013.url: https://sabiod.lis-lab.fr/icml2013/challenge_description.html (visited on 06/17/2026)

ICML 2013 Workshop on Machine Learning for Bioacoustics.Challenge Description. 2013.url: https://sabiod.lis-lab.fr/icml2013/challenge_description.html (visited on 06/17/2026)

2013

[41] [41]

The International Best Track Archive for Climate Stewardship (IB- TrACS): Unifying tropical cyclone best track data

Kenneth R. Knapp et al. “The International Best Track Archive for Climate Stewardship (IB- TrACS): Unifying tropical cyclone best track data”. In:Bulletin of the American Meteorological Society(2010).doi:10.1175/2009BAMS2755.1

work page doi:10.1175/2009bams2755.1 2010

[42] [42]

International Best Track Archive for Climate Stewardship (IBTrACS) Project

J. Gahtan et al. “International Best Track Archive for Climate Stewardship (IBTrACS) Project”. In:NOAANationalCentersforEnvironmentalInformation(2024).doi: 10.25921/82ty-9e16

work page doi:10.25921/82ty-9e16 2024

[43] [43]

2008.url:https: //www.metoc.navy.mil/jtwc/products/atcr/2008atcr.pdf

Joint Typhoon Warning Center.Annual Tropical Cyclone Report 2008. 2008.url:https: //www.metoc.navy.mil/jtwc/products/atcr/2008atcr.pdf

2008

[44] [44]

Updated 2025-09-23

NOAANationalCentersforEnvironmentalInformation.IBTrACSv04r01ColumnDocumentation. Updated 2025-09-23. 2025.url: https : / / www . ncei . noaa . gov / sites / default / files/2025-09/IBTrACS_v04r01_column_documentation.pdf. 28 A. Appendix A.1. Contributions and Acknowledgments Knowledge-Action Infrastructure:Zongsheng Cao†, Bihao Zhan, Zhijie Zhong Full-dom...

2025