arxiv: 2605.09018 · v1 · submitted 2026-05-09 · 💻 cs.NE · cs.AI· cs.LG

Recognition: no theorem link

Evolutionary Ensemble of Agents

Liu Yang, Zongmin Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:59 UTC · model grok-4.3

classification 💻 cs.NE cs.AIcs.LG

keywords evolutionary ensemblecoding agentsco-evolutionagent adaptationin-context operator networksperformance ceilingsmulti-agent systems

0 comments

The pith

Organizing capable coding agents into a self-revising evolutionary ensemble lets them discover new mechanisms and surpass fixed performance ceilings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing high-performing coding agents do not need replacement or internal redesign to tackle harder tasks. Instead, the authors organize them into two co-evolving populations—one of functional code solvers and one of guidance states—that compete in synchronous races. Ratings update according to the marginal gains each agent contributes to the current solver, allowing the ensemble to revise its own behaviors stage by stage. Applied to In-Context Operator Networks, the process autonomously finds a rescale-then-interpolate rule that improves example-count generalization. Ablations show that without ongoing, stage-dependent adaptation the system encounters phase mismatches and cannot escape the ceilings reached by any single fixed agent.

Core claim

By maintaining two co-evolving populations of code solvers and agent guidance states, evaluating them through synchronous races, and updating empirical Elo ratings on the basis of marginal gains, the Evolutionary Ensemble framework autonomously discovers a robust rescale-then-interpolate mechanism that enables reliable example-count generalization in ICON; controlled ablations establish that stage-dependent agent adaptation is required to navigate shifting search landscapes and that the self-revising ensemble is the essential driver for exceeding static performance limits.

What carries the argument

The Evolutionary Ensemble (EvE) framework, which fixes the base agent substrate and evolves cumulative guidance and skills via two co-evolving populations evaluated in synchronous races with marginal-gain Elo rating updates.

Load-bearing premise

The synchronous race and marginal-gain Elo updates accurately capture and drive meaningful co-evolution without introducing artifacts from the evaluation setup or phase mismatch in the ICON task.

What would settle it

A direct comparison in which a fixed initial agent or a frozen best-evolved agent matches or exceeds EvE's ICON performance, discovers an equivalent rescale-then-interpolate mechanism, and shows no phase mismatch would falsify the necessity of the self-revising ensemble.

Figures

Figures reproduced from arXiv: 2605.09018 by Liu Yang, Zongmin Yu.

**Figure 2.** Figure 2: The EvE framework. EvE maintains two co-evolving populations: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Search trajectories for all three variants (two independent runs each). [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Per-example-count error curves (k = 1 through k = 10) at two training budgets. Each variant contributes the best PE method from each of its two independent runs; the Seed (gray, ICON vanilla PE) is the reference baseline. See [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Run 1: from agent guidance to solver code. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Run 2: from agent guidance to solver code, with the same layout [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 4.** Figure 4: error jumps from ∼0.05 (in-distribution) to ∼0.9 (out-of-distribution), an 18× degradation. The Seed demonstrates that positional-encoding design, not model capacity, is the bottleneck for example-count generalization. EvE Run 1: InterpolatedDemoPE Agent: evolved. The same PE family first appeared at iteration 2, but the best solver using this PE design was produced at iteration 15. Best at: iteration 15 (… view at source ↗

read the original abstract

We introduce Evolutionary Ensemble (EvE), a decentralized framework that organizes existing, highly capable coding agents into a live, co-evolving system for algorithmic discovery. Rather than reinventing the wheel within the "LLMs as optimizers" paradigm, EvE fixes the base agent substrate and focuses entirely on evolving the cumulative guidance and skills that dictate agent behaviors. By maintaining two co-evolving populations, namely functional code solvers and agent guidance states, the system evaluates agents through a synchronous race, updating their empirical Elo ratings based on the marginal gains they contribute to the current solver state. When applied to a research bottleneck in In-Context Operator Networks (ICON), EvE autonomously discovered a robust rescale-then-interpolate mechanism that enables reliable example-count generalization. Crucially, controlled ablations reveal the absolute necessity of stage-dependent agent adaptation to navigate the shifting search landscapes of complex codebases. Compared to variants driven by a fixed initial agent or even a frozen "best-evolved" agent, EvE uniquely avoids phase mismatch, demonstrating that organizing agents into a self-revising ensemble is the fundamental driver for breaking through static performance ceilings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvE's co-evolution of solvers and guidance states via synchronous races and marginal-gain Elo is a distinct framing for agent ensembles, but the abstract gives no numbers or update rules, so the necessity claims are hard to verify.

read the letter

The core idea is keeping base coding agents fixed while running two co-evolving populations: functional solvers and guidance states. They compete in synchronous races, score marginal gains with Elo updates, and apply this to the ICON task where it turns up a rescale-then-interpolate mechanism for better example-count generalization. The ablations against fixed-initial and frozen-best agents are meant to show that only the live ensemble avoids phase mismatch as the search landscape shifts.

Referee Report

3 major / 1 minor

Summary. The paper introduces Evolutionary Ensemble (EvE), a decentralized co-evolutionary framework that maintains two populations—functional code solvers and agent guidance states—evaluated via synchronous races with marginal-gain Elo rating updates. Applied to In-Context Operator Networks (ICON), EvE discovers a rescale-then-interpolate mechanism enabling example-count generalization. Controlled ablations against fixed-initial and frozen-best agents are claimed to demonstrate the absolute necessity of stage-dependent adaptation to avoid phase mismatch and break static performance ceilings in complex codebases.

Significance. If the ablations and discovery are substantiated with quantitative evidence, the work could advance evolutionary multi-agent systems for automated code and algorithm discovery by emphasizing co-evolution of behaviors over base-model changes. The focus on self-revising ensembles and the concrete ICON mechanism provide a falsifiable example of navigating shifting search landscapes, which is a strength if supported by reproducible experiments.

major comments (3)

[Abstract] Abstract: The central claim that 'controlled ablations reveal the absolute necessity of stage-dependent agent adaptation' is load-bearing for the paper's contribution, yet no quantitative results, performance metrics, error bars, or specific ablation outcomes (e.g., success rates or generalization scores) are reported to support this necessity versus fixed or frozen variants.
[Abstract] Abstract: The synchronous race and marginal-gain Elo updates are described without any equations, pseudocode, or formal definition of the rating update rule, race termination, or how 'current solver state' and 'marginal gains' are computed; this prevents verification that the ablations isolate true co-evolution rather than artifacts from the evaluation loop or ICON-specific phase mismatch.
[Abstract] Abstract: No details are given on the discovered 'robust rescale-then-interpolate mechanism,' including the search process that identified it, verification of its generalization across example counts, or direct comparisons to alternative mechanisms.

minor comments (1)

[Abstract] Abstract: The phrase 'phase mismatch' is introduced without a precise definition in the context of the ICON task or how it manifests in the synchronous race setup.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and have revised the abstract to incorporate quantitative support for the central claims, references to the formal definitions, and additional details on the discovered mechanism.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'controlled ablations reveal the absolute necessity of stage-dependent agent adaptation' is load-bearing for the paper's contribution, yet no quantitative results, performance metrics, error bars, or specific ablation outcomes (e.g., success rates or generalization scores) are reported to support this necessity versus fixed or frozen variants.

Authors: We agree that the abstract should more explicitly support the central claim with key quantitative results. The main manuscript presents these ablation outcomes in Section 5.2 (including success rates, generalization scores across example counts, and error bars from repeated runs) comparing EvE to the fixed-initial and frozen-best variants. We have revised the abstract to summarize the key metrics demonstrating the necessity of stage-dependent adaptation. revision: yes
Referee: [Abstract] Abstract: The synchronous race and marginal-gain Elo updates are described without any equations, pseudocode, or formal definition of the rating update rule, race termination, or how 'current solver state' and 'marginal gains' are computed; this prevents verification that the ablations isolate true co-evolution rather than artifacts from the evaluation loop or ICON-specific phase mismatch.

Authors: The synchronous race, marginal-gain Elo updates, rating update rule, race termination criteria, current solver state, and marginal gains are formally defined in Section 3.2 with Equations (1)–(3) and Algorithm 1. The abstract is length-constrained, so we have added a sentence referencing these definitions and the ablation controls that isolate co-evolution effects. revision: partial
Referee: [Abstract] Abstract: No details are given on the discovered 'robust rescale-then-interpolate mechanism,' including the search process that identified it, verification of its generalization across example counts, or direct comparisons to alternative mechanisms.

Authors: The rescale-then-interpolate mechanism, the co-evolutionary search process that identified it, verification of generalization across example counts (1–10), and comparisons to alternatives are detailed in Section 4.3 with supporting experiments. We have revised the abstract to briefly describe the mechanism and its generalization properties. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external task performance and ablations

full rationale

The abstract describes an evolutionary framework using internal Elo ratings based on marginal gains, but the load-bearing results are the autonomous discovery of a rescale-then-interpolate mechanism for example-count generalization in ICON and controlled ablations versus fixed/frozen agents. No equations, self-citations, or self-definitional reductions are present in the provided text. Task success (generalization) and ablation comparisons appear measured against independent benchmarks rather than reducing to the framework's own definitions by construction. Per hard rules, absence of quotable reduction to inputs yields score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the central claims rest on the assumption that Elo ratings derived from marginal gains in a synchronous race provide a reliable fitness signal for co-evolution; no explicit free parameters, axioms, or invented entities are detailed.

pith-pipeline@v0.9.0 · 5484 in / 1133 out tokens · 25932 ms · 2026-05-12T01:59:41.070719+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 4 internal anchors

[1]

Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

Liu, Ziyang and Guo, Xinyan and Wei, Xuchen and Hao, Han and Yang, Liu , title =. arXiv preprint arXiv:2604.23472 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[2]

1992 , publisher =

Genetic Programming: On the Programming of Computers by Means of Natural Selection , author =. 1992 , publisher =

work page 1992
[3]

and Le, Quoc V

Real, Esteban and Liang, Chen and So, David R. and Le, Quoc V. , booktitle =. 2020 , publisher =

work page 2020
[4]

Handbook of Evolutionary Machine Learning , series =

Evolution Through Large Models , author =. Handbook of Evolutionary Machine Learning , series =. 2023 , publisher =

work page 2023
[5]

Nature , volume =

Mathematical Discoveries From Program Search With Large Language Models , author =. Nature , volume =. 2024 , doi =

work page 2024
[6]

Proceedings of the 41st International Conference on Machine Learning , year =

Evolution of Heuristics: Towards Efficient Automatic Algorithm Design Using Large Language Model , author =. Proceedings of the 41st International Conference on Machine Learning , year =

work page
[7]

Novikov, Alexander and Vu, Ngan and Eisenberger, Marvin and Dupont, Emilien and Huang, Po-Sen and Wagner, Adam Zsolt and Shirobokov, Sergey and Kozlovskii, Borislav and Ruiz, Francisco J. R. and Mehrabian, Abbas and Kumar, M. Pawan and See, Abigail and Chaudhuri, Swarat and Holland, George and Davies, Alex and Nowozin, Sebastian and Kohli, Pushmeet and Ba...

work page
[8]

2026 , howpublished =

work page 2026
[9]

Codeevolve: An open source evolutionary coding agent for algorithm discovery and optimization.arXiv preprint arXiv:2510.14150, 2025

Assump. arXiv preprint arXiv:2510.14150 , year =

work page arXiv
[10]

T., Imajuku, Y., and Cetin, E

Lange, Robert Tjarko and Imajuku, Yuki and Cetin, Edoardo , year =. 2509.19349 , archivePrefix =

work page arXiv
[11]

Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025

Zhang, Jenny and Hu, Shengran and Lu, Cong and Lange, Robert and Clune, Jeff , year =. Darwin G. 2505.22954 , archivePrefix =

work page arXiv
[12]

Huxley-G

Wang, Wenyi and Pi. Huxley-G. 2025 , eprint =

work page 2025
[13]

2026 , eprint =

Hyperagents , author =. 2026 , eprint =

work page 2026
[14]

2026 , eprint =

Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing , author =. 2026 , eprint =

work page 2026
[15]

Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

Qu, Ao and Zheng, Han and Zhou, Zijian and Yan, Yihao and Tang, Yihong and Ong, Shao Yong and Hong, Fenglu and Zhou, Kaichen and Jiang, Chonghe and Kong, Minwei and Zhu, Jiacheng and Jiang, Xuan and Li, Sirui and Wu, Cathy and Low, Bryan Kian Hsiang and Zhao, Jinhua and Liang, Paul Pu , year =. 2604.01658 , archivePrefix =

work page arXiv
[16]

Proceedings of the 41st International Conference on Machine Learning , series =

Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution , author =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , eprint =

work page 2024
[17]

Large Language Models as Optimizers

Large Language Models as Optimizers , author =. International Conference on Learning Representations (ICLR) , year =. 2309.03409 , archivePrefix =

work page internal anchor Pith review arXiv
[18]

2025 , eprint =

A Self-Improving Coding Agent , author =. 2025 , eprint =

work page 2025
[19]

Physica D: Nonlinear Phenomena , volume =

Co-Evolving Parasites Improve Simulated Evolution as an Optimization Procedure , author =. Physica D: Nonlinear Phenomena , volume =. 1990 , doi =

work page 1990
[20]

Evolutionary Computation , volume =

New Methods for Competitive Coevolution , author =. Evolutionary Computation , volume =. 1997 , doi =

work page 1997
[21]

Artificial Life , volume =

Coevolutionary Computation , author =. Artificial Life , volume =. 1995 , doi =

work page 1995
[22]

Proceedings of the National Academy of Sciences , volume =

In-Context Operator Learning With Data Prompts for Differential Equation Problems , author =. Proceedings of the National Academy of Sciences , volume =. 2023 , doi =

work page 2023
[23]

, journal =

Yang, Liu and Osher, Stanley J. , journal =. 2024 , doi =

work page 2024
[24]

, journal =

Cao, Yadi and Liu, Yuxuan and Yang, Liu and Yu, Rose and Schaeffer, Hayden and Osher, Stanley J. , journal =

work page
[25]

Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction

Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction , author =. arXiv preprint arXiv:2603.12725 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Neural Networks , year =

Fine-Tune Language Models as Multi-Modal Differential Equation Solvers , author =. Neural Networks , year =

work page
[27]

2025 , eprint =

Probabilistic Operator Learning: Generative Modeling and Uncertainty Quantification for Foundation Models of Differential Equations , author =. 2025 , eprint =

work page 2025
[28]

NeurIPS 2023 Workshop on the Symbiosis of Deep Learning and Differential Equations (DLDE III) , year =

Does In-Context Operator Learning Generalize to Domain-Shifted Settings? , author =. NeurIPS 2023 Workshop on the Symbiosis of Deep Learning and Differential Equations (DLDE III) , year =

work page 2023
[29]

2024 , eprint =

In-Context Learning of Linear Systems: Generalization Theory and Applications to Operator Learning , author =. 2024 , eprint =

work page 2024
[30]

2025 , eprint =

Continuum Transformers Perform In-Context Learning by Operator Gradient Descent , author =. 2025 , eprint =

work page 2025
[31]

2025 , eprint =

Solving Optimal Execution Problems via In-Context Operator Networks , author =. 2025 , eprint =

work page 2025
[32]

2026 , eprint =

In-Context Operator Learning on the Space of Probability Measures , author =. 2026 , eprint =

work page 2026
[33]

CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

Zhang, Hanrong and Fan, Shichen and Zou, Henry Peng and Chen, Yankai and Wang, Zhenting and Zhou, Jiayuan and Li, Chengze and Huang, Wei-Chieh and Yao, Yifei and Zheng, Kening and Liu, Xue and Li, Xiaoxiao and Yu, Philip S. , title =. arXiv preprint arXiv:2604.01687 , year =

work page internal anchor Pith review Pith/arXiv arXiv