Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent
Pith reviewed 2026-06-30 05:49 UTC · model grok-4.3
The pith
A 35B agent reaches trillion-parameter performance on long-horizon tasks by scaling trajectories instead of model size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agents-A1, a 35B Mixture-of-Experts model trained via a three-stage recipe on long-horizon trajectories averaging 45K tokens, achieves leading scores on SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8) while remaining competitive on SciCode (44.3), HLE (47.6), and BrowseComp (75.5) against 1T models such as Kimi-K2.6 and DeepSeek-V4-pro.
What carries the argument
The long-horizon knowledge-action infrastructure that connects external knowledge, actions, observations, and verifier outcomes to produce representative agentic trajectories.
Load-bearing premise
The generated long-horizon trajectories are assumed to represent real deployment conditions and to transfer across the six domains without overfitting or benchmark leakage.
What would settle it
A new long-horizon agent benchmark constructed after the training data cutoff, with no trajectory overlap, would show whether Agents-A1 maintains its reported advantage over the 1T models.
read the original abstract
We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model that reaches trillion-parameter-level performance by scaling the agent horizon. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and scaling heterogeneous agent abilities. To support this goal, we build a long-horizon knowledge-action infrastructure that connects external knowledge, actions, observations, and verifier outcomes, producing agentic trajectories with an average length of 45K tokens. Based on this, we train Agents-A1 with a three-stage recipe. First, we perform full-domain supervised fine-tuning to align the base model with broad agentic behaviors. Second, we train domain-level teacher models to capture specialized expertise in each domain. Third, we propose a multi-teacher domain-routed on-policy distillation with salient vocabulary alignment to improve knowledge transfer efficiency across different domains, unifying six heterogeneous domains into one deployable student model. Agents-A1 achieves strong and broad performance for long-horizon agent benchmarks. Compared with 1T-parameter model such as Kimi-K2.6 and DeepSeek-V4-pro, Agents-A1 achieves leading results on SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8), and remains highly competitive on SciCode (44.3), HLE (47.6) and BrowseComp (75.5). We hope this work provides the community with a practical path for scaling the horizon using a 35B agent that can reach or match the performance of 1T models on long-horizon tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agents-A1, a 35B Mixture-of-Experts agentic model that reaches performance levels comparable to 1T-parameter models on long-horizon tasks by scaling agent horizons rather than parameters. It describes a knowledge-action infrastructure generating trajectories averaging 45K tokens, followed by a three-stage training process (full-domain SFT, domain-specific teachers, and multi-teacher on-policy distillation with vocabulary alignment) that unifies six heterogeneous domains into a single deployable model. The central empirical claim is that Agents-A1 leads or matches 1T models on benchmarks including SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), MolBench-Bind (56.8), while remaining competitive on SciCode, HLE, and BrowseComp.
Significance. If the performance claims hold after proper verification, the work would provide evidence that horizon scaling via long trajectories and multi-domain distillation can be more efficient than parameter scaling for agentic capabilities, offering a practical route to high-performance agents on smaller models. The explicit infrastructure for 45K-token trajectories and the three-stage recipe would constitute reusable contributions if accompanied by sufficient controls and ablations.
major comments (3)
- [Abstract] Abstract: The reported benchmark scores (e.g., SEAL-0 56.4, IFBench 80.6) are presented without any description of evaluation protocols, controls for data leakage from the 45K-token trajectories, error bars, or statistical significance testing. This information is load-bearing for the central claim that Agents-A1 matches or exceeds 1T models such as Kimi-K2.6 and DeepSeek-V4-pro.
- [Training recipe and infrastructure] Training and infrastructure description: No ablation is reported that isolates the contribution of the long-horizon (45K-token) trajectories from the three-stage recipe or that verifies the trajectories were generated without including or paraphrasing items from the six evaluation benchmarks. Without such controls, the cross-domain generalization claim rests on an unverified assumption.
- [Results and comparison] Benchmark comparison: The headline results against 1T models are stated as leading on five benchmarks, yet the manuscript supplies no details on whether the evaluation sets were held out from the knowledge-action infrastructure data or on any contamination audit. This directly affects the validity of the horizon-scaling thesis.
minor comments (2)
- [Abstract] The abstract and introduction use the term 'leading results' without defining the precise ranking criteria or listing all competing models evaluated.
- [Method] Notation for the multi-teacher distillation step (vocabulary alignment) is introduced at a high level; a concrete equation or pseudocode would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater transparency on evaluation protocols, ablations, and contamination controls. These points are important for strengthening the central claims. We respond to each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported benchmark scores (e.g., SEAL-0 56.4, IFBench 80.6) are presented without any description of evaluation protocols, controls for data leakage from the 45K-token trajectories, error bars, or statistical significance testing. This information is load-bearing for the central claim that Agents-A1 matches or exceeds 1T models such as Kimi-K2.6 and DeepSeek-V4-pro.
Authors: We agree that the abstract lacks these details. In the revision we will expand the evaluation section to describe the protocols used for each benchmark, steps taken to mitigate data leakage from the trajectory data, and any available statistical information. Error bars were not computed owing to the prohibitive cost of repeated full evaluations; we will explicitly note this limitation and discuss observed variance across domains where feasible. revision: yes
-
Referee: [Training recipe and infrastructure] Training and infrastructure description: No ablation is reported that isolates the contribution of the long-horizon (45K-token) trajectories from the three-stage recipe or that verifies the trajectories were generated without including or paraphrasing items from the six evaluation benchmarks. Without such controls, the cross-domain generalization claim rests on an unverified assumption.
Authors: We acknowledge the absence of a dedicated ablation separating trajectory length from the three-stage recipe. We will add such an ablation comparing shorter- versus full-length trajectories. We will also document the data-generation pipeline and any decontamination procedures applied to ensure the 45K-token trajectories do not contain or paraphrase benchmark items. revision: yes
-
Referee: [Results and comparison] Benchmark comparison: The headline results against 1T models are stated as leading on five benchmarks, yet the manuscript supplies no details on whether the evaluation sets were held out from the knowledge-action infrastructure data or on any contamination audit. This directly affects the validity of the horizon-scaling thesis.
Authors: We will add a dedicated subsection confirming that all evaluation sets were held out from the knowledge-action infrastructure and describing the contamination audit performed. These details will be placed in the results section to support the reported comparisons. revision: yes
Circularity Check
No circularity: empirical benchmark claims rest on described training stages without self-referential reductions or load-bearing self-citations
full rationale
The provided manuscript text (abstract plus context) describes a three-stage training recipe (full-domain SFT, domain teachers, multi-teacher distillation) and reports benchmark scores as outcomes of the long-horizon infrastructure. No equations, fitted parameters renamed as predictions, or self-citation chains appear that would make any result equivalent to its inputs by construction. The central claim is an empirical comparison to 1T models; absent any derivation that collapses to the input data or prior self-work, the derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Kimi.Kimi K2.6: Advancing Open-Source Coding.https://www.kimi.com/blog/kimi-k2-
-
[2]
https://openai.com/index/introducing- gpt- 5- 5
OpenAI.Introducing GPT-5.5. https://openai.com/index/introducing- gpt- 5- 5 . 2026
2026
-
[3]
Anthropic.Introducing Claude Opus 4.6.https://www.anthropic.com/news/claude- opus-4-6. 2026
2026
-
[4]
Gemini 3 Pro - Google DeepMind.url: https://deepmind.google/models/gemini/ pro/
-
[5]
DeepSeek-AI.DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence. 2026
2026
-
[6]
Qwen.Qwen3.5: Towards Native Multimodal Agents.https://qwen.ai/blog?id=qwen3.5. 2026
2026
-
[7]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan et al. “MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering”. In:arXiv preprint arXiv:2410.07095(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
OpenAI.FrontierScience: Evaluating AI’s Ability To Perform Expert-level Scientific Tasks.https: //openai.com/index/frontierscience/. 2026
2026
-
[9]
Long Phan et al. “Humanity’s last exam”. In:arXiv preprint arXiv:2501.14249(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei et al. “Browsecomp: A simple yet challenging benchmark for browsing agents”. In: arXiv preprint arXiv:2504.12516(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
GLM-5: from Vibe Coding to Agentic Engineering
Aohan Zeng et al. “Glm-5: from vibe coding to agentic engineering”. In:arXiv preprint arXiv:2602.15763(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Towards Long-horizon Agentic Multimodal Search
Yifan Du et al. “Towards Long-horizon Agentic Multimodal Search”. In:arXiv preprint arXiv:2604.12890(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Zhipu AI.GLM-5.2: Built for Long-Horizon Tasks.https://z.ai/blog/glm-5.2. 2026
2026
-
[14]
Kimi K2.5: Visual Agentic Intelligence
Kimi Team et al. “Kimi K2. 5: Visual Agentic Intelligence”. In:arXiv preprint arXiv:2602.02276 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
arXiv preprint arXiv:2510.11967 , year=
Weiwei Sun et al. “Scaling long-horizon llm agent via context-folding”. In:arXiv preprint arXiv:2510.11967(2025)
-
[16]
Cwm: An open-weights llm for research on code generation with world models
Jade Copet et al. “Cwm: An open-weights llm for research on code generation with world models”. In:arXiv preprint arXiv:2510.02387(2025)
-
[17]
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham et al. “SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models”. In:arXiv preprint arXiv:2506.01062(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Generalizing verifiable instruction following
Valentina Pyatkin et al. “Generalizing verifiable instruction following”. In:Advances in Neural Information Processing Systems38 (2026)
2026
-
[19]
Fangchen Yu et al. “HiPhO: How Far Are (M) LLMs from Humans in the Latest High School Physics Olympiad Benchmark?” In:arXiv preprint arXiv:2509.07894(2025)
-
[20]
Lisheng Zhang et al. “MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization”. In:arXiv preprint arXiv:2604.21937 (2026). 26
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
SciCode: A Research Coding Benchmark Curated by Scientists
Minyang Tian et al. “SciCode: A Research Coding Benchmark Curated by Scientists”. In: Advances in Neural Information Processing Systems. Ed. by A. Globerson et al. Vol. 37. Cur- ran Associates, Inc., 2024, pp. 30624–30650.doi: 10 . 52202 / 079017 - 0963.url: https : / / proceedings . neurips . cc / paper _ files / paper / 2024 / file / 36850592258c8c41cec...
2024
-
[22]
Agents-K1: Towards Agent-native Knowledge Orchestration
Zongsheng Cao et al. “Agents-K1: Towards Agent-native Knowledge Orchestration”. In:arXiv preprint arXiv:2606.13669(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
On-Policy Distillation
Kevin Lu and Thinking Machines Lab. “On-Policy Distillation”. In:Thinking Machines Lab: Con- nectionism(2025). https://thinkingmachines.ai/blog/on-policy-distillation.doi: 10.64434/ tml.20251026
2025
-
[24]
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
Yuqian Fu et al. “Revisiting on-policy distillation: Empirical failure modes and simple fixes”. In:arXiv preprint arXiv:2603.25562(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
Rushi Qiang et al. “MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering”. In:arXiv preprint arXiv:2505.07782(2025)
-
[26]
MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery
Shangheng Du et al. “MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery”. In:arXiv preprint arXiv:2606.06473(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
GitHub repository
NVIDIA.NeMo Gym: An Open Source Library for Scaling Reinforcement Learning Environments for LLM.https://github.com/NVIDIA-NeMo/Gym. GitHub repository. 2025
2025
-
[28]
WildChat: 1M ChatGPT Interaction Logs in the Wild
Wenting Zhao et al. “WildChat: 1M ChatGPT Interaction Logs in the Wild”. In:The Twelfth International Conference on Learning Representations. 2024.url:https://openreview. net/forum?id=Bl8u7ZRlbM
2024
-
[29]
Victor Barres et al.𝜏2-Bench: Evaluating Conversational Agents in a Dual-Control Environment
-
[30]
arXiv:2506.07982 [cs.AI].url:https://arxiv.org/abs/2506.07982
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real- world Applications
Wei He et al. “VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real- world Applications”. In:arXiv preprint arXiv:2509.26490(2025)
-
[32]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models”. In:arXiv preprint arXiv:2402.03300(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
ZelinTanetal.PAPO:StabilizingRubricIntegrationTrainingviaDecoupledAdvantageNormaliza- tion. 2026. arXiv:2603.26535 [cs.AI].url:https://arxiv.org/abs/2603.26535
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Gaia: a benchmark for general ai assistants
Grégoire Mialon et al. “Gaia: a benchmark for general ai assistants”. In:International Conference on Learning Representations. Vol. 2024. 2024, pp. 9025–9049
2024
-
[35]
xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations
Kaiyuan Chen et al. “xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations”. In:arXiv preprint arXiv:2506.13651(2025)
-
[36]
Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks
Yushi Bai et al. “Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks”. In:Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025, pp. 3639–3664
2025
-
[37]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou et al. “Instruction-Following Evaluation for Large Language Models”. In:arXiv preprint arXiv:2311.07911(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu et al. “Deepseek-v3. 2: Pushing the frontier of open large language models”. In:arXiv preprint arXiv:2512.02556(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Siyu Liu et al.MatTools: Benchmarking Large Language Models for Materials Science Tools. 2025. arXiv: 2505 . 10852 [cond-mat.mtrl-sci].url: https : / / arxiv . org / abs / 2505 . 10852. 27
2025
-
[40]
2013.url: https://sabiod.lis-lab.fr/icml2013/challenge_description.html (visited on 06/17/2026)
ICML 2013 Workshop on Machine Learning for Bioacoustics.Challenge Description. 2013.url: https://sabiod.lis-lab.fr/icml2013/challenge_description.html (visited on 06/17/2026)
2013
-
[41]
Kenneth R. Knapp et al. “The International Best Track Archive for Climate Stewardship (IB- TrACS): Unifying tropical cyclone best track data”. In:Bulletin of the American Meteorological Society(2010).doi:10.1175/2009BAMS2755.1
-
[42]
International Best Track Archive for Climate Stewardship (IBTrACS) Project
J. Gahtan et al. “International Best Track Archive for Climate Stewardship (IBTrACS) Project”. In:NOAANationalCentersforEnvironmentalInformation(2024).doi: 10.25921/82ty-9e16
-
[43]
2008.url:https: //www.metoc.navy.mil/jtwc/products/atcr/2008atcr.pdf
Joint Typhoon Warning Center.Annual Tropical Cyclone Report 2008. 2008.url:https: //www.metoc.navy.mil/jtwc/products/atcr/2008atcr.pdf
2008
-
[44]
Updated 2025-09-23
NOAANationalCentersforEnvironmentalInformation.IBTrACSv04r01ColumnDocumentation. Updated 2025-09-23. 2025.url: https : / / www . ncei . noaa . gov / sites / default / files/2025-09/IBTrACS_v04r01_column_documentation.pdf. 28 A. Appendix A.1. Contributions and Acknowledgments Knowledge-Action Infrastructure:Zongsheng Cao†, Bihao Zhan, Zhijie Zhong Full-dom...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.