Sakana Fugu Technical Report
Pith reviewed 2026-06-26 14:20 UTC · model grok-4.3
The pith
Fugu orchestrator models dynamically create scaffolds to coordinate LLM agent teams and exceed any single model's performance on hard tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fugu models are themselves language models trained to understand user queries and dynamically devise agentic scaffolds to solve them. Through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent, achieving state-of-the-art results compared to other publicly accessible models across a range of challenging tasks, including SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity's Last Exam, and CharXiv Reasoning.
What carries the argument
Adaptive agentic scaffolds dynamically generated by the orchestrator models to harness and combine capabilities across an LLM agent team.
If this is right
- Teams of specialized LLMs can be orchestrated to reach higher performance than any one model alone.
- The same training approach yields both a latency-balanced model and a higher-quality ultra variant.
- Dynamic, query-adaptive scaffolds offer a route to collective intelligence without requiring a single larger model.
- The infrastructure and design principles turn these methods into a working production system.
Where Pith is reading between the lines
- Orchestration training might transfer to domains beyond the reported benchmarks if the scaffold generation generalizes.
- Future systems could test whether the same approach improves when the underlying agent pool changes over time.
- The method raises the question of how much of the gain comes from the choice of which agents to include versus how they are coordinated.
Load-bearing premise
The large-scale fine-tuning, evolutionary algorithms, and reinforcement learning produce orchestrators whose scaffolds deliver genuine performance gains rather than benchmark-specific optimizations or selection effects.
What would settle it
A controlled test in which Fugu's dynamic scaffold generation is replaced by a fixed coordination template and performance on the same benchmarks falls back to the level of the best single agent.
read the original abstract
The capabilities of frontier Large Language Models (LLMs) continue to advance, with different providers increasingly specializing in distinct domains. This raises a natural next objective: how to combine the individual specializations of various LLMs into a collectively intelligent system. To this end, we report the development of Sakana Fugu, a family of orchestrator models that harness and amplify the capabilities of an LLM agent team. Fugu models are themselves language models trained to understand user queries and dynamically devise agentic scaffolds to solve them. Through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent, achieving state-of-the-art results compared to other publicly accessible models across a range of challenging tasks, including SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity's Last Exam, and CharXiv Reasoning. We release two models: Fugu, which balances performance with latency for everyday use, and Fugu-Ultra, which prioritizes answer quality on the hardest problems. We describe our training paradigm, which encompasses large-scale fine-tuning, evolutionary algorithms, and reinforcement learning approaches, along with the infrastructure and core design principles that turn these methods into a production system. We hope this report encourages further research into multi-agent systems and dynamic, query-adaptive agentic scaffolds as a path toward the next frontier of AI capabilities, accessed through collective intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Sakana Fugu, a family of orchestrator language models trained via large-scale fine-tuning, evolutionary algorithms, and reinforcement learning. These models dynamically generate query-adaptive agentic scaffolds to combine the capabilities of multiple LLM agents, claiming state-of-the-art performance on SWE-Bench Pro, Terminal Bench, LiveCodeBench, GPQA-Diamond, Humanity's Last Exam, and CharXiv Reasoning. Two variants are released (Fugu for balanced latency/performance and Fugu-Ultra for maximum quality), along with descriptions of the training paradigm, infrastructure, and design principles for multi-agent collective intelligence.
Significance. If the central claims hold after verification that the reported benchmarks were held out from evolutionary and RL training, the work would be significant for demonstrating scalable collective intelligence through dynamic scaffolds rather than single-model scaling. The release of production-oriented models and the explicit call for further multi-agent research are positive contributions. However, the absence of methodological details on fitness functions, training tasks, ablations, and evaluation protocols prevents assessment of whether the results reflect genuine generalization.
major comments (2)
- [Abstract] Abstract: The claim that 'through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent' and achieves SOTA is load-bearing but unsupported by any description of the fitness function, reward model, or task distribution used in the evolutionary algorithms and reinforcement learning stages. Without this information it is impossible to rule out that the six listed benchmarks (or close variants) were included in the search process, which would make the results consistent with benchmark-specific optimization rather than the asserted collective-intelligence mechanism.
- [Abstract] Abstract: No ablation studies, error bars, baseline comparisons with the same underlying LLMs, or details on scaffold evaluation methodology are provided. These omissions directly affect the ability to evaluate whether the reported gains are attributable to the adaptive orchestrator or to unstated selection effects and hyperparameter tuning.
minor comments (1)
- [Abstract] The manuscript states that 'we describe our training paradigm... along with the infrastructure and core design principles' but the provided text contains no such sections or technical specifications, making the production-system claims impossible to assess.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater methodological transparency to support the abstract claims. We agree that the current version would benefit from expanded details and will revise accordingly. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'through these adaptive scaffolds, Fugu accesses performance beyond any individual LLM agent' and achieves SOTA is load-bearing but unsupported by any description of the fitness function, reward model, or task distribution used in the evolutionary algorithms and reinforcement learning stages. Without this information it is impossible to rule out that the six listed benchmarks (or close variants) were included in the search process, which would make the results consistent with benchmark-specific optimization rather than the asserted collective-intelligence mechanism.
Authors: We acknowledge that the abstract claims require supporting methodological details to rule out contamination. The manuscript describes the overall training paradigm at a high level but does not provide the requested specifics on fitness functions, reward models, or task distributions. In revision we will add a new subsection under Methods that details the fitness functions employed in evolutionary search, the reward models used in RL, and the construction of training task distributions. We will also explicitly state that the six evaluation benchmarks were held out from all stages of evolutionary algorithm search and RL training, enabling independent verification of the generalization claims. revision: yes
-
Referee: [Abstract] Abstract: No ablation studies, error bars, baseline comparisons with the same underlying LLMs, or details on scaffold evaluation methodology are provided. These omissions directly affect the ability to evaluate whether the reported gains are attributable to the adaptive orchestrator or to unstated selection effects and hyperparameter tuning.
Authors: We agree that the absence of these elements limits rigorous assessment of the orchestrator's contribution. In the revised manuscript we will add an 'Ablations and Analysis' section containing: (i) ablation studies that isolate the adaptive scaffold component, (ii) error bars computed over multiple independent evaluation runs, (iii) baseline comparisons that use identical underlying LLMs without the Fugu orchestrator, and (iv) a precise description of the scaffold evaluation protocol and metrics. These additions will directly address concerns about selection effects and hyperparameter tuning. revision: yes
Circularity Check
No significant circularity; derivation self-contained against external benchmarks
full rationale
The provided abstract and description contain no equations, fitted parameters, or derivation steps that reduce by construction to the reported benchmark results. The training paradigm (fine-tuning + evolutionary algorithms + RL) is described at a high level without specifying fitness functions, held-out status of the six evaluation benchmarks, or any self-citation that bears the central claim. No self-definitional loops, fitted-input predictions, or ansatz smuggling appear in the text. The SOTA claim is presented as an empirical outcome of the described system rather than a mathematical identity or renamed input, satisfying the requirement for independent content.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
American Invitational Mathematics Examination, 2023 , year =
2023
-
[2]
American Invitational Mathematics Examination, 2024 , year =
2024
-
[3]
Deep learning with long short-term memory networks for financial market predictions , journal =
Thomas Fischer and Christopher Krauss , keywords =. Deep learning with long short-term memory networks for financial market predictions , journal =. 2018 , issn =. doi:https://doi.org/10.1016/j.ejor.2017.11.054 , url =
-
[4]
Advances in Neural Information Processing Systems , volume=
Livecodebench pro: How do olympiad medalists judge llms in competitive programming? , author=. Advances in Neural Information Processing Systems , volume=
-
[5]
arXiv preprint arXiv:2501.14249 , year=
Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=
-
[6]
arXiv preprint arXiv:2409.12640 , year=
Michelangelo: Long context evaluations beyond haystacks via latent structure queries , author=. arXiv preprint arXiv:2409.12640 , year=
-
[7]
2025 , publisher=
Artificial Analysis Long Context Reasoning Benchmark(LCR) , author=. 2025 , publisher=
2025
-
[8]
2026 , month = apr, howpublished =
2026
-
[9]
2026 , month = feb, howpublished =
2026
-
[10]
2026 , month = jun, howpublished =
2026
-
[11]
arXiv preprint arXiv:2509.16941 , year=
Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? , author=. arXiv preprint arXiv:2509.16941 , year=
-
[12]
arXiv preprint arXiv:2509.07968 , year=
Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge , author=. arXiv preprint arXiv:2509.07968 , year=
-
[13]
TODO -- pull from arXiv , journal =
-
[14]
2026 , howpublished =
2026
-
[15]
Advances in neural information processing systems , volume=
Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=
-
[16]
arXiv preprint arXiv:2411.04872 , year =
Glazer, Elliot and Erdil, Ege and Besiroglu, Tamay and Chicharro, Diego and Chen, Evan and Gunning, Alex and Olsson, Caroline Falkman and Denain, Jean-Stanislas and Ho, Anson and de Oliveira Santos, Emily and J. arXiv preprint arXiv:2411.04872 , year =
-
[17]
2026 , eprint=
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author=. 2026 , eprint=
2026
-
[18]
2026 , month = may, howpublished =
2026
-
[19]
Advances in Neural Information Processing Systems , volume=
Scicode: A research coding benchmark curated by scientists , author=. Advances in Neural Information Processing Systems , volume=
-
[20]
Advances in Neural Information Processing Systems , volume=
Charxiv: Charting gaps in realistic chart understanding in multimodal llms , author=. Advances in Neural Information Processing Systems , volume=
-
[21]
2022 , url =
Apple Stock Price from 1980-2021 , howpublished =. 2022 , url =
1980
-
[22]
American Invitational Mathematics Examination, 2025 , year =
2025
-
[23]
2026 , month = may, day =
An. 2026 , month = may, day =
2026
-
[24]
2026 , month = jun, day =
Making. 2026 , month = jun, day =
2026
-
[25]
2026 , howpublished =
What Is the. 2026 , howpublished =
2026
-
[26]
Function Calling , year =
-
[27]
2026 , howpublished =
Tool Use with. 2026 , howpublished =
2026
-
[28]
2025 , month = jul, day =
Luong, Thang and Lockhart, Edward , title =. 2025 , month = jul, day =
2025
-
[29]
2025 , eprint =
Patwardhan, Tejal and Dias, Rachel and Proehl, Elizabeth and Kim, Grace and Wang, Michele and Watkins, Olivia and Fishman, Sim. 2025 , eprint =
2025
-
[30]
2026 , howpublished =
mini-swe-agent: The Minimal. 2026 , howpublished =
2026
-
[31]
2026 , eprint=
Retrieval Augmented Conversational Recommendation with Reinforcement Learning , author=. 2026 , eprint=
2026
-
[32]
Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM) , pages =
Wang-Cheng Kang and Julian McAuley , title =. Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM) , pages =. 2018 , publisher =
2018
-
[33]
Terminus-2: Harbor's Reference Agent Implementation , year =
-
[34]
Assessing
Carlini, Nicholas and Cheng, Newton and Lucas, Keane and Moore, Michael and Nasr, Milad and Prabhushankar, Vinay and Xiao, Winnie and Angulu, Hakeem and. Assessing. 2026 , month = apr, howpublished =
2026
-
[35]
2025 , eprint =
Sequential Diagnosis with Language Models , author =. 2025 , eprint =
2025
-
[36]
arXiv preprint arXiv:2503.04412 , year=
Wider or deeper? scaling llm inference-time compute with adaptive branching tree search , author=. arXiv preprint arXiv:2503.04412 , year=
-
[37]
International Conference on Learning Representations , volume=
Automated design of agentic systems , author=. International Conference on Learning Representations , volume=
-
[38]
arXiv preprint arXiv:2210.03629 , year=
React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=
-
[39]
ACM Transactions on Information Systems , volume=
A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=
2025
-
[40]
Frontiers of Computer Science , volume=
Tool learning with large language models: A survey , author=. Frontiers of Computer Science , volume=. 2025 , publisher=
2025
-
[41]
arXiv preprint arXiv:2512.04388 , year=
Learning to Orchestrate Agents in Natural Language with the Conductor , author=. arXiv preprint arXiv:2512.04388 , year=
-
[42]
arXiv preprint arXiv:2512.04695 , year=
TRINITY: An Evolved LLM Coordinator , author=. arXiv preprint arXiv:2512.04695 , year=
-
[43]
arXiv preprint arXiv:2502.13138 , year=
Aide: Ai-driven exploration in the space of code , author=. arXiv preprint arXiv:2502.13138 , year=
-
[44]
arXiv preprint arXiv:2408.08435 , year=
Automated design of agentic systems , author=. arXiv preprint arXiv:2408.08435 , year=
-
[45]
arXiv preprint arXiv:2412.17287 , year=
Llm4ad: A platform for algorithm design with large language model , author=. arXiv preprint arXiv:2412.17287 , year=
-
[46]
2025 , publisher =
OpenEvolve: an open-source evolutionary coding agent , author =. 2025 , publisher =
2025
-
[47]
arXiv preprint arXiv:2505.22954 , year=
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents , author=. arXiv preprint arXiv:2505.22954 , year=
-
[48]
2025 , institution=
The AI CUDA engineer: Agentic CUDA kernel discovery, optimization and composition , author=. 2025 , institution=
2025
-
[49]
arXiv preprint arXiv:2506.09050 , year=
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering , author=. arXiv preprint arXiv:2506.09050 , year=
-
[50]
arXiv preprint arXiv:2506.13131 , year=
AlphaEvolve: A coding agent for scientific and algorithmic discovery , author=. arXiv preprint arXiv:2506.13131 , year=
-
[51]
arXiv preprint arXiv:2504.08066 , year=
The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search , author=. arXiv preprint arXiv:2504.08066 , year=
-
[52]
2025 , eprint=
KernelBench: Can LLMs Write Efficient GPU Kernels? , author=. 2025 , eprint=
2025
-
[53]
arXiv preprint arXiv:2408.06292 , year=
The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=
-
[54]
The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
Discovering Preference Optimization Algorithms with and for Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[55]
2024 , eprint=
The Llama 3 Herd of Models , author=. 2024 , eprint=
2024
-
[56]
2020 , eprint=
Language Models are Few-Shot Learners , author=. 2020 , eprint=
2020
-
[57]
Proceedings of the Companion Conference on Genetic and Evolutionary Computation , pages=
Discovering evolution strategies via meta-black-box optimization , author=. Proceedings of the Companion Conference on Genetic and Evolutionary Computation , pages=
-
[58]
Advances in Neural Information Processing Systems , volume=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[59]
Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=
A systematic evaluation of large language models of code , author=. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=
-
[60]
arXiv preprint arXiv:2107.03374 , year=
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
-
[61]
Nature , volume=
Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=
2024
-
[62]
arXiv preprint arXiv:2001.08361 , year=
Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=
Pith/arXiv arXiv 2001
-
[63]
arXiv preprint arXiv:2005.04305 , year=
Measuring the algorithmic efficiency of neural networks , author=. arXiv preprint arXiv:2005.04305 , year=
arXiv 2005
-
[64]
Science , volume=
Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=
2022
-
[65]
arXiv preprint arXiv:1607.06450 , year=
Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=
-
[66]
Proceedings of the IEEE international conference on computer vision , pages=
Arbitrary style transfer in real-time with adaptive instance normalization , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[67]
Queue , volume=
Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? , author=. Queue , volume=. 2008 , publisher=
2008
-
[68]
2016 , publisher=
Programming massively parallel processors: a hands-on approach , author=. 2016 , publisher=
2016
-
[69]
IEEE micro , volume=
Parallel computing experiences with CUDA , author=. IEEE micro , volume=. 2008 , publisher=
2008
-
[70]
arXiv preprint arXiv:1410.0759 , year=
cudnn: Efficient primitives for deep learning , author=. arXiv preprint arXiv:1410.0759 , year=
-
[71]
arXiv preprint arXiv:1603.04467 , year=
Tensorflow: Large-scale machine learning on heterogeneous distributed systems , author=. arXiv preprint arXiv:1603.04467 , year=
-
[72]
JAX: composable transformations of Python+ NumPy programs , author=
-
[73]
Advances in neural information processing systems , volume=
Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=
-
[74]
Advances in Neural Information Processing Systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=
-
[75]
arXiv preprint arXiv:2312.10997 , year=
Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , year=
-
[76]
arXiv preprint arXiv:2205.10625 , year=
Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=
-
[77]
2022 International Joint Conference on Neural Networks (IJCNN) , pages=
Compute trends across three eras of machine learning , author=. 2022 International Joint Conference on Neural Networks (IJCNN) , pages=. 2022 , organization=
2022
-
[78]
arXiv preprint arXiv:2402.05201 , year=
The effect of sampling temperature on problem solving in large language models , author=. arXiv preprint arXiv:2402.05201 , year=
-
[79]
arXiv preprint arXiv:2407.21787 , year=
Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=
-
[80]
arXiv preprint arXiv:2108.07258 , year=
On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.