Sakana Fugu Technical Report

Edoardo Cetin; Haruto Goda; Hyunin Lee; Iaroslav Tymchenko; Jinglue Xu; Mari Ashiga; Nhan Nguyen; Qi Sun; Shashank Kotyan; So Kuroki

arxiv: 2606.21228 · v2 · pith:DOBHCEEJnew · submitted 2026-06-19 · 💻 cs.LG

Sakana Fugu Technical Report

Yujin Tang , Edoardo Cetin , Jinglue Xu , Qi Sun , Stefan Nielsen , Vincent Richard , Haruto Goda , Iaroslav Tymchenko

show 6 more authors

Nhan Nguyen Hyunin Lee Mari Ashiga Shashank Kotyan So Kuroki Tarin Clanuwat

This is my paper

Pith reviewed 2026-06-26 14:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords orchestrator modelsagentic scaffoldsmulti-agent systemsLLM agentscollective intelligencefine-tuningevolutionary algorithmsreinforcement learning

0 comments

The pith

Fugu orchestrator models dynamically create scaffolds to coordinate LLM agent teams and exceed any single model's performance on hard tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Sakana Fugu as a family of language models trained to serve as orchestrators for teams of LLM agents. These orchestrators analyze a query and generate custom agentic scaffolds that direct how the agents work together. The central idea is that this coordination produces results stronger than what any one LLM can deliver on its own. The training combines large-scale fine-tuning with evolutionary algorithms and reinforcement learning to build the orchestrators. Two versions are released, one optimized for speed and one for maximum quality on difficult problems.

Core claim

Fugu models are themselves language models trained to understand user queries and dynamically devise agentic scaffolds to solve them. Through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent, achieving state-of-the-art results compared to other publicly accessible models across a range of challenging tasks, including SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity's Last Exam, and CharXiv Reasoning.

What carries the argument

Adaptive agentic scaffolds dynamically generated by the orchestrator models to harness and combine capabilities across an LLM agent team.

If this is right

Teams of specialized LLMs can be orchestrated to reach higher performance than any one model alone.
The same training approach yields both a latency-balanced model and a higher-quality ultra variant.
Dynamic, query-adaptive scaffolds offer a route to collective intelligence without requiring a single larger model.
The infrastructure and design principles turn these methods into a working production system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Orchestration training might transfer to domains beyond the reported benchmarks if the scaffold generation generalizes.
Future systems could test whether the same approach improves when the underlying agent pool changes over time.
The method raises the question of how much of the gain comes from the choice of which agents to include versus how they are coordinated.

Load-bearing premise

The large-scale fine-tuning, evolutionary algorithms, and reinforcement learning produce orchestrators whose scaffolds deliver genuine performance gains rather than benchmark-specific optimizations or selection effects.

What would settle it

A controlled test in which Fugu's dynamic scaffold generation is replaced by a fixed coordination template and performance on the same benchmarks falls back to the level of the best single agent.

read the original abstract

The capabilities of frontier Large Language Models (LLMs) continue to advance, with different providers increasingly specializing in distinct domains. This raises a natural next objective: how to combine the individual specializations of various LLMs into a collectively intelligent system. To this end, we report the development of Sakana Fugu, a family of orchestrator models that harness and amplify the capabilities of an LLM agent team. Fugu models are themselves language models trained to understand user queries and dynamically devise agentic scaffolds to solve them. Through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent, achieving state-of-the-art results compared to other publicly accessible models across a range of challenging tasks, including SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity's Last Exam, and CharXiv Reasoning. We release two models: Fugu, which balances performance with latency for everyday use, and Fugu-Ultra, which prioritizes answer quality on the hardest problems. We describe our training paradigm, which encompasses large-scale fine-tuning, evolutionary algorithms, and reinforcement learning approaches, along with the infrastructure and core design principles that turn these methods into a production system. We hope this report encourages further research into multi-agent systems and dynamic, query-adaptive agentic scaffolds as a path toward the next frontier of AI capabilities, accessed through collective intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fugu claims SOTA on hard benchmarks through learned dynamic orchestration but supplies no evidence that evolutionary and RL stages avoided optimizing on those same tasks.

read the letter

Fugu is a technical report on training orchestrator models that learn to build query-adaptive agent scaffolds from specialized LLMs. It claims this produces performance beyond any single agent and reaches SOTA on SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity's Last Exam, and CharXiv Reasoning. The training uses large-scale fine-tuning plus evolutionary algorithms and reinforcement learning, and the authors release two models along with notes on infrastructure and design principles.

The model releases are the clearest positive. Public weights let others run independent checks instead of relying on the report's numbers.

The training combination and production focus add an incremental practical angle on top of existing multi-agent orchestration ideas.

The main weakness is the missing training and evaluation detail. Nothing is said about the fitness function, the tasks used during evolution or RL, or whether the six reported benchmarks were held out from the search. If those benchmarks or close variants entered the evolutionary process, the SOTA results would be consistent with direct optimization rather than a general collective-intelligence mechanism. No ablations, error bars, or held-out comparisons appear either. The stress-test concern about search artifacts therefore applies directly to the information given.

This is aimed at researchers and engineers already working on agentic systems who might want to download and test the released models. A reader looking for validated methods or reproducible gains will find the evidence too thin to rely on.

I would not bring this to reading group. I would not cite the performance claims. It does not deserve peer review until the methods section supplies the missing controls on training tasks and evaluation.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Sakana Fugu, a family of orchestrator language models trained via large-scale fine-tuning, evolutionary algorithms, and reinforcement learning. These models dynamically generate query-adaptive agentic scaffolds to combine the capabilities of multiple LLM agents, claiming state-of-the-art performance on SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity's Last Exam, and CharXiv Reasoning. Two variants are released (Fugu for balanced latency/performance and Fugu-Ultra for maximum quality), along with descriptions of the training paradigm, infrastructure, and design principles for multi-agent collective intelligence.

Significance. If the central claims hold after verification that the reported benchmarks were held out from evolutionary and RL training, the work would be significant for demonstrating scalable collective intelligence through dynamic scaffolds rather than single-model scaling. The release of production-oriented models and the explicit call for further multi-agent research are positive contributions. However, the absence of methodological details on fitness functions, training tasks, ablations, and evaluation protocols prevents assessment of whether the results reflect genuine generalization.

major comments (2)

[Abstract] Abstract: The claim that 'through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent' and achieves SOTA is load-bearing but unsupported by any description of the fitness function, reward model, or task distribution used in the evolutionary algorithms and reinforcement learning stages. Without this information it is impossible to rule out that the six listed benchmarks (or close variants) were included in the search process, which would make the results consistent with benchmark-specific optimization rather than the asserted collective-intelligence mechanism.
[Abstract] Abstract: No ablation studies, error bars, baseline comparisons with the same underlying LLMs, or details on scaffold evaluation methodology are provided. These omissions directly affect the ability to evaluate whether the reported gains are attributable to the adaptive orchestrator or to unstated selection effects and hyperparameter tuning.

minor comments (1)

[Abstract] The manuscript states that 'we describe our training paradigm... along with the infrastructure and core design principles' but the provided text contains no such sections or technical specifications, making the production-system claims impossible to assess.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater methodological transparency to support the abstract claims. We agree that the current version would benefit from expanded details and will revise accordingly. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent' and achieves SOTA is load-bearing but unsupported by any description of the fitness function, reward model, or task distribution used in the evolutionary algorithms and reinforcement learning stages. Without this information it is impossible to rule out that the six listed benchmarks (or close variants) were included in the search process, which would make the results consistent with benchmark-specific optimization rather than the asserted collective-intelligence mechanism.

Authors: We acknowledge that the abstract claims require supporting methodological details to rule out contamination. The manuscript describes the overall training paradigm at a high level but does not provide the requested specifics on fitness functions, reward models, or task distributions. In revision we will add a new subsection under Methods that details the fitness functions employed in evolutionary search, the reward models used in RL, and the construction of training task distributions. We will also explicitly state that the six evaluation benchmarks were held out from all stages of evolutionary algorithm search and RL training, enabling independent verification of the generalization claims. revision: yes
Referee: [Abstract] Abstract: No ablation studies, error bars, baseline comparisons with the same underlying LLMs, or details on scaffold evaluation methodology are provided. These omissions directly affect the ability to evaluate whether the reported gains are attributable to the adaptive orchestrator or to unstated selection effects and hyperparameter tuning.

Authors: We agree that the absence of these elements limits rigorous assessment of the orchestrator's contribution. In the revised manuscript we will add an 'Ablations and Analysis' section containing: (i) ablation studies that isolate the adaptive scaffold component, (ii) error bars computed over multiple independent evaluation runs, (iii) baseline comparisons that use identical underlying LLMs without the Fugu orchestrator, and (iv) a precise description of the scaffold evaluation protocol and metrics. These additions will directly address concerns about selection effects and hyperparameter tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external benchmarks

full rationale

The provided abstract and description contain no equations, fitted parameters, or derivation steps that reduce by construction to the reported benchmark results. The training paradigm (fine-tuning + evolutionary algorithms + RL) is described at a high level without specifying fitness functions, held-out status of the six evaluation benchmarks, or any self-citation that bears the central claim. No self-definitional loops, fitted-input predictions, or ansatz smuggling appear in the text. The SOTA claim is presented as an empirical outcome of the described system rather than a mathematical identity or renamed input, satisfying the requirement for independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all ledger entries are therefore empty.

pith-pipeline@v0.9.1-grok · 5819 in / 1132 out tokens · 26490 ms · 2026-06-26T14:20:17.698671+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

299 extracted references · 6 canonical work pages

[1]

American Invitational Mathematics Examination, 2023 , year =

2023
[2]

American Invitational Mathematics Examination, 2024 , year =

2024
[3]

Deep learning with long short-term memory networks for financial market predictions , journal =

Thomas Fischer and Christopher Krauss , keywords =. Deep learning with long short-term memory networks for financial market predictions , journal =. 2018 , issn =. doi:https://doi.org/10.1016/j.ejor.2017.11.054 , url =

work page doi:10.1016/j.ejor.2017.11.054 2018
[4]

Advances in Neural Information Processing Systems , volume=

Livecodebench pro: How do olympiad medalists judge llms in competitive programming? , author=. Advances in Neural Information Processing Systems , volume=
[5]

arXiv preprint arXiv:2501.14249 , year=

Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2409.12640 , year=

Michelangelo: Long context evaluations beyond haystacks via latent structure queries , author=. arXiv preprint arXiv:2409.12640 , year=

arXiv
[7]

2025 , publisher=

Artificial Analysis Long Context Reasoning Benchmark(LCR) , author=. 2025 , publisher=

2025
[8]

2026 , month = apr, howpublished =

2026
[9]

2026 , month = feb, howpublished =

2026
[10]

2026 , month = jun, howpublished =

2026
[11]

arXiv preprint arXiv:2509.16941 , year=

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? , author=. arXiv preprint arXiv:2509.16941 , year=

Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2509.07968 , year=

Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge , author=. arXiv preprint arXiv:2509.07968 , year=

arXiv
[13]

TODO -- pull from arXiv , journal =
[14]

2026 , howpublished =

2026
[15]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
[16]

arXiv preprint arXiv:2411.04872 , year =

Glazer, Elliot and Erdil, Ege and Besiroglu, Tamay and Chicharro, Diego and Chen, Evan and Gunning, Alex and Olsson, Caroline Falkman and Denain, Jean-Stanislas and Ho, Anson and de Oliveira Santos, Emily and J. arXiv preprint arXiv:2411.04872 , year =

Pith/arXiv arXiv
[17]

2026 , eprint=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

2026
[18]

2026 , month = may, howpublished =

2026
[19]

Advances in Neural Information Processing Systems , volume=

Scicode: A research coding benchmark curated by scientists , author=. Advances in Neural Information Processing Systems , volume=
[20]

Advances in Neural Information Processing Systems , volume=

Charxiv: Charting gaps in realistic chart understanding in multimodal llms , author=. Advances in Neural Information Processing Systems , volume=
[21]

2022 , url =

Apple Stock Price from 1980-2021 , howpublished =. 2022 , url =

1980
[22]

American Invitational Mathematics Examination, 2025 , year =

2025
[23]

2026 , month = may, day =

An. 2026 , month = may, day =

2026
[24]

2026 , month = jun, day =

Making. 2026 , month = jun, day =

2026
[25]

2026 , howpublished =

What Is the. 2026 , howpublished =

2026
[26]

Function Calling , year =
[27]

2026 , howpublished =

Tool Use with. 2026 , howpublished =

2026
[28]

2025 , month = jul, day =

Luong, Thang and Lockhart, Edward , title =. 2025 , month = jul, day =

2025
[29]

2025 , eprint =

Patwardhan, Tejal and Dias, Rachel and Proehl, Elizabeth and Kim, Grace and Wang, Michele and Watkins, Olivia and Fishman, Sim. 2025 , eprint =

2025
[30]

2026 , howpublished =

mini-swe-agent: The Minimal. 2026 , howpublished =

2026
[31]

2026 , eprint=

Retrieval Augmented Conversational Recommendation with Reinforcement Learning , author=. 2026 , eprint=

2026
[32]

Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM) , pages =

Wang-Cheng Kang and Julian McAuley , title =. Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM) , pages =. 2018 , publisher =

2018
[33]

Terminus-2: Harbor's Reference Agent Implementation , year =
[34]

Assessing

Carlini, Nicholas and Cheng, Newton and Lucas, Keane and Moore, Michael and Nasr, Milad and Prabhushankar, Vinay and Xiao, Winnie and Angulu, Hakeem and. Assessing. 2026 , month = apr, howpublished =

2026
[35]

2025 , eprint =

Sequential Diagnosis with Language Models , author =. 2025 , eprint =

2025
[36]

arXiv preprint arXiv:2503.04412 , year=

Wider or deeper? scaling llm inference-time compute with adaptive branching tree search , author=. arXiv preprint arXiv:2503.04412 , year=

arXiv
[37]

International Conference on Learning Representations , volume=

Automated design of agentic systems , author=. International Conference on Learning Representations , volume=
[38]

arXiv preprint arXiv:2210.03629 , year=

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

Pith/arXiv arXiv
[39]

ACM Transactions on Information Systems , volume=

A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

2025
[40]

Frontiers of Computer Science , volume=

Tool learning with large language models: A survey , author=. Frontiers of Computer Science , volume=. 2025 , publisher=

2025
[41]

arXiv preprint arXiv:2512.04388 , year=

Learning to Orchestrate Agents in Natural Language with the Conductor , author=. arXiv preprint arXiv:2512.04388 , year=

Pith/arXiv arXiv
[42]

arXiv preprint arXiv:2512.04695 , year=

TRINITY: An Evolved LLM Coordinator , author=. arXiv preprint arXiv:2512.04695 , year=

Pith/arXiv arXiv
[43]

arXiv preprint arXiv:2502.13138 , year=

Aide: Ai-driven exploration in the space of code , author=. arXiv preprint arXiv:2502.13138 , year=

Pith/arXiv arXiv
[44]

arXiv preprint arXiv:2408.08435 , year=

Automated design of agentic systems , author=. arXiv preprint arXiv:2408.08435 , year=

Pith/arXiv arXiv
[45]

arXiv preprint arXiv:2412.17287 , year=

Llm4ad: A platform for algorithm design with large language model , author=. arXiv preprint arXiv:2412.17287 , year=

arXiv
[46]

2025 , publisher =

OpenEvolve: an open-source evolutionary coding agent , author =. 2025 , publisher =

2025
[47]

arXiv preprint arXiv:2505.22954 , year=

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents , author=. arXiv preprint arXiv:2505.22954 , year=

Pith/arXiv arXiv
[48]

2025 , institution=

The AI CUDA engineer: Agentic CUDA kernel discovery, optimization and composition , author=. 2025 , institution=

2025
[49]

arXiv preprint arXiv:2506.09050 , year=

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering , author=. arXiv preprint arXiv:2506.09050 , year=

arXiv
[50]

arXiv preprint arXiv:2506.13131 , year=

AlphaEvolve: A coding agent for scientific and algorithmic discovery , author=. arXiv preprint arXiv:2506.13131 , year=

Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2504.08066 , year=

The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search , author=. arXiv preprint arXiv:2504.08066 , year=

Pith/arXiv arXiv
[52]

2025 , eprint=

KernelBench: Can LLMs Write Efficient GPU Kernels? , author=. 2025 , eprint=

2025
[53]

arXiv preprint arXiv:2408.06292 , year=

The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

Pith/arXiv arXiv
[54]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Discovering Preference Optimization Algorithms with and for Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[55]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[56]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

2020
[57]

Proceedings of the Companion Conference on Genetic and Evolutionary Computation , pages=

Discovering evolution strategies via meta-black-box optimization , author=. Proceedings of the Companion Conference on Genetic and Evolutionary Computation , pages=
[58]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[59]

Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

A systematic evaluation of large language models of code , author=. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=
[60]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv
[61]

Nature , volume=

Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

2024
[62]

arXiv preprint arXiv:2001.08361 , year=

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

Pith/arXiv arXiv 2001
[63]

arXiv preprint arXiv:2005.04305 , year=

Measuring the algorithmic efficiency of neural networks , author=. arXiv preprint arXiv:2005.04305 , year=

arXiv 2005
[64]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

2022
[65]

arXiv preprint arXiv:1607.06450 , year=

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

Pith/arXiv arXiv
[66]

Proceedings of the IEEE international conference on computer vision , pages=

Arbitrary style transfer in real-time with adaptive instance normalization , author=. Proceedings of the IEEE international conference on computer vision , pages=
[67]

Queue , volume=

Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? , author=. Queue , volume=. 2008 , publisher=

2008
[68]

2016 , publisher=

Programming massively parallel processors: a hands-on approach , author=. 2016 , publisher=

2016
[69]

IEEE micro , volume=

Parallel computing experiences with CUDA , author=. IEEE micro , volume=. 2008 , publisher=

2008
[70]

arXiv preprint arXiv:1410.0759 , year=

cudnn: Efficient primitives for deep learning , author=. arXiv preprint arXiv:1410.0759 , year=

Pith/arXiv arXiv
[71]

arXiv preprint arXiv:1603.04467 , year=

Tensorflow: Large-scale machine learning on heterogeneous distributed systems , author=. arXiv preprint arXiv:1603.04467 , year=

Pith/arXiv arXiv
[72]

JAX: composable transformations of Python+ NumPy programs , author=
[73]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=
[74]

Advances in Neural Information Processing Systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=
[75]

arXiv preprint arXiv:2312.10997 , year=

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , year=

Pith/arXiv arXiv
[76]

arXiv preprint arXiv:2205.10625 , year=

Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=

Pith/arXiv arXiv
[77]

2022 International Joint Conference on Neural Networks (IJCNN) , pages=

Compute trends across three eras of machine learning , author=. 2022 International Joint Conference on Neural Networks (IJCNN) , pages=. 2022 , organization=

2022
[78]

arXiv preprint arXiv:2402.05201 , year=

The effect of sampling temperature on problem solving in large language models , author=. arXiv preprint arXiv:2402.05201 , year=

arXiv
[79]

arXiv preprint arXiv:2407.21787 , year=

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

Pith/arXiv arXiv
[80]

arXiv preprint arXiv:2108.07258 , year=

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

American Invitational Mathematics Examination, 2023 , year =

2023

[2] [2]

American Invitational Mathematics Examination, 2024 , year =

2024

[3] [3]

Deep learning with long short-term memory networks for financial market predictions , journal =

Thomas Fischer and Christopher Krauss , keywords =. Deep learning with long short-term memory networks for financial market predictions , journal =. 2018 , issn =. doi:https://doi.org/10.1016/j.ejor.2017.11.054 , url =

work page doi:10.1016/j.ejor.2017.11.054 2018

[4] [4]

Advances in Neural Information Processing Systems , volume=

Livecodebench pro: How do olympiad medalists judge llms in competitive programming? , author=. Advances in Neural Information Processing Systems , volume=

[5] [5]

arXiv preprint arXiv:2501.14249 , year=

Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2409.12640 , year=

Michelangelo: Long context evaluations beyond haystacks via latent structure queries , author=. arXiv preprint arXiv:2409.12640 , year=

arXiv

[7] [7]

2025 , publisher=

Artificial Analysis Long Context Reasoning Benchmark(LCR) , author=. 2025 , publisher=

2025

[8] [8]

2026 , month = apr, howpublished =

2026

[9] [9]

2026 , month = feb, howpublished =

2026

[10] [10]

2026 , month = jun, howpublished =

2026

[11] [11]

arXiv preprint arXiv:2509.16941 , year=

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? , author=. arXiv preprint arXiv:2509.16941 , year=

Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2509.07968 , year=

Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge , author=. arXiv preprint arXiv:2509.07968 , year=

arXiv

[13] [13]

TODO -- pull from arXiv , journal =

[14] [14]

2026 , howpublished =

2026

[15] [15]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

[16] [16]

arXiv preprint arXiv:2411.04872 , year =

Glazer, Elliot and Erdil, Ege and Besiroglu, Tamay and Chicharro, Diego and Chen, Evan and Gunning, Alex and Olsson, Caroline Falkman and Denain, Jean-Stanislas and Ho, Anson and de Oliveira Santos, Emily and J. arXiv preprint arXiv:2411.04872 , year =

Pith/arXiv arXiv

[17] [17]

2026 , eprint=

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=

2026

[18] [18]

2026 , month = may, howpublished =

2026

[19] [19]

Advances in Neural Information Processing Systems , volume=

Scicode: A research coding benchmark curated by scientists , author=. Advances in Neural Information Processing Systems , volume=

[20] [20]

Advances in Neural Information Processing Systems , volume=

Charxiv: Charting gaps in realistic chart understanding in multimodal llms , author=. Advances in Neural Information Processing Systems , volume=

[21] [21]

2022 , url =

Apple Stock Price from 1980-2021 , howpublished =. 2022 , url =

1980

[22] [22]

American Invitational Mathematics Examination, 2025 , year =

2025

[23] [23]

2026 , month = may, day =

An. 2026 , month = may, day =

2026

[24] [24]

2026 , month = jun, day =

Making. 2026 , month = jun, day =

2026

[25] [25]

2026 , howpublished =

What Is the. 2026 , howpublished =

2026

[26] [26]

Function Calling , year =

[27] [27]

2026 , howpublished =

Tool Use with. 2026 , howpublished =

2026

[28] [28]

2025 , month = jul, day =

Luong, Thang and Lockhart, Edward , title =. 2025 , month = jul, day =

2025

[29] [29]

2025 , eprint =

Patwardhan, Tejal and Dias, Rachel and Proehl, Elizabeth and Kim, Grace and Wang, Michele and Watkins, Olivia and Fishman, Sim. 2025 , eprint =

2025

[30] [30]

2026 , howpublished =

mini-swe-agent: The Minimal. 2026 , howpublished =

2026

[31] [31]

2026 , eprint=

Retrieval Augmented Conversational Recommendation with Reinforcement Learning , author=. 2026 , eprint=

2026

[32] [32]

Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM) , pages =

Wang-Cheng Kang and Julian McAuley , title =. Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM) , pages =. 2018 , publisher =

2018

[33] [33]

Terminus-2: Harbor's Reference Agent Implementation , year =

[34] [34]

Assessing

Carlini, Nicholas and Cheng, Newton and Lucas, Keane and Moore, Michael and Nasr, Milad and Prabhushankar, Vinay and Xiao, Winnie and Angulu, Hakeem and. Assessing. 2026 , month = apr, howpublished =

2026

[35] [35]

2025 , eprint =

Sequential Diagnosis with Language Models , author =. 2025 , eprint =

2025

[36] [36]

arXiv preprint arXiv:2503.04412 , year=

Wider or deeper? scaling llm inference-time compute with adaptive branching tree search , author=. arXiv preprint arXiv:2503.04412 , year=

arXiv

[37] [37]

International Conference on Learning Representations , volume=

Automated design of agentic systems , author=. International Conference on Learning Representations , volume=

[38] [38]

arXiv preprint arXiv:2210.03629 , year=

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

Pith/arXiv arXiv

[39] [39]

ACM Transactions on Information Systems , volume=

A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

2025

[40] [40]

Frontiers of Computer Science , volume=

Tool learning with large language models: A survey , author=. Frontiers of Computer Science , volume=. 2025 , publisher=

2025

[41] [41]

arXiv preprint arXiv:2512.04388 , year=

Learning to Orchestrate Agents in Natural Language with the Conductor , author=. arXiv preprint arXiv:2512.04388 , year=

Pith/arXiv arXiv

[42] [42]

arXiv preprint arXiv:2512.04695 , year=

TRINITY: An Evolved LLM Coordinator , author=. arXiv preprint arXiv:2512.04695 , year=

Pith/arXiv arXiv

[43] [43]

arXiv preprint arXiv:2502.13138 , year=

Aide: Ai-driven exploration in the space of code , author=. arXiv preprint arXiv:2502.13138 , year=

Pith/arXiv arXiv

[44] [44]

arXiv preprint arXiv:2408.08435 , year=

Automated design of agentic systems , author=. arXiv preprint arXiv:2408.08435 , year=

Pith/arXiv arXiv

[45] [45]

arXiv preprint arXiv:2412.17287 , year=

Llm4ad: A platform for algorithm design with large language model , author=. arXiv preprint arXiv:2412.17287 , year=

arXiv

[46] [46]

2025 , publisher =

OpenEvolve: an open-source evolutionary coding agent , author =. 2025 , publisher =

2025

[47] [47]

arXiv preprint arXiv:2505.22954 , year=

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents , author=. arXiv preprint arXiv:2505.22954 , year=

Pith/arXiv arXiv

[48] [48]

2025 , institution=

The AI CUDA engineer: Agentic CUDA kernel discovery, optimization and composition , author=. 2025 , institution=

2025

[49] [49]

arXiv preprint arXiv:2506.09050 , year=

ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering , author=. arXiv preprint arXiv:2506.09050 , year=

arXiv

[50] [50]

arXiv preprint arXiv:2506.13131 , year=

AlphaEvolve: A coding agent for scientific and algorithmic discovery , author=. arXiv preprint arXiv:2506.13131 , year=

Pith/arXiv arXiv

[51] [51]

arXiv preprint arXiv:2504.08066 , year=

The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search , author=. arXiv preprint arXiv:2504.08066 , year=

Pith/arXiv arXiv

[52] [52]

2025 , eprint=

KernelBench: Can LLMs Write Efficient GPU Kernels? , author=. 2025 , eprint=

2025

[53] [53]

arXiv preprint arXiv:2408.06292 , year=

The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

Pith/arXiv arXiv

[54] [54]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Discovering Preference Optimization Algorithms with and for Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[55] [55]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[56] [56]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

2020

[57] [57]

Proceedings of the Companion Conference on Genetic and Evolutionary Computation , pages=

Discovering evolution strategies via meta-black-box optimization , author=. Proceedings of the Companion Conference on Genetic and Evolutionary Computation , pages=

[58] [58]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

[59] [59]

Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

A systematic evaluation of large language models of code , author=. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

[60] [60]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv

[61] [61]

Nature , volume=

Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

2024

[62] [62]

arXiv preprint arXiv:2001.08361 , year=

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

Pith/arXiv arXiv 2001

[63] [63]

arXiv preprint arXiv:2005.04305 , year=

Measuring the algorithmic efficiency of neural networks , author=. arXiv preprint arXiv:2005.04305 , year=

arXiv 2005

[64] [64]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

2022

[65] [65]

arXiv preprint arXiv:1607.06450 , year=

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

Pith/arXiv arXiv

[66] [66]

Proceedings of the IEEE international conference on computer vision , pages=

Arbitrary style transfer in real-time with adaptive instance normalization , author=. Proceedings of the IEEE international conference on computer vision , pages=

[67] [67]

Queue , volume=

Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? , author=. Queue , volume=. 2008 , publisher=

2008

[68] [68]

2016 , publisher=

Programming massively parallel processors: a hands-on approach , author=. 2016 , publisher=

2016

[69] [69]

IEEE micro , volume=

Parallel computing experiences with CUDA , author=. IEEE micro , volume=. 2008 , publisher=

2008

[70] [70]

arXiv preprint arXiv:1410.0759 , year=

cudnn: Efficient primitives for deep learning , author=. arXiv preprint arXiv:1410.0759 , year=

Pith/arXiv arXiv

[71] [71]

arXiv preprint arXiv:1603.04467 , year=

Tensorflow: Large-scale machine learning on heterogeneous distributed systems , author=. arXiv preprint arXiv:1603.04467 , year=

Pith/arXiv arXiv

[72] [72]

JAX: composable transformations of Python+ NumPy programs , author=

[73] [73]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

[74] [74]

Advances in Neural Information Processing Systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=

[75] [75]

arXiv preprint arXiv:2312.10997 , year=

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , year=

Pith/arXiv arXiv

[76] [76]

arXiv preprint arXiv:2205.10625 , year=

Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=

Pith/arXiv arXiv

[77] [77]

2022 International Joint Conference on Neural Networks (IJCNN) , pages=

Compute trends across three eras of machine learning , author=. 2022 International Joint Conference on Neural Networks (IJCNN) , pages=. 2022 , organization=

2022

[78] [78]

arXiv preprint arXiv:2402.05201 , year=

The effect of sampling temperature on problem solving in large language models , author=. arXiv preprint arXiv:2402.05201 , year=

arXiv

[79] [79]

arXiv preprint arXiv:2407.21787 , year=

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

Pith/arXiv arXiv

[80] [80]

arXiv preprint arXiv:2108.07258 , year=

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

Pith/arXiv arXiv