arxiv: 2605.14386 · v1 · submitted 2026-05-14 · 💻 cs.NE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning

Taebong Kim , Youngsik Hong , Minsik Kim , Sunyoung Choi , Jaewon Jang , Junghoon Shin , Minseo Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:09 UTC · model grok-4.3

classification 💻 cs.NE cs.AI

keywords evolutionary mergingtraining-free scalinglanguage model reasoningweight-space recombinationMRI-Trust FusionGPQA Diamondcross-architecture mergingmerge genome

0 comments

The pith

Evolutionary merging of existing language model checkpoints produces superior reasoning performance without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that frontier-level reasoning in language models can be achieved by reorganizing capabilities already latent in existing checkpoints rather than by training new models from scratch. It introduces a gradient-free evolutionary process that recombines model weights using a 14-dimensional merge genome and an adaptive trust-weighted fusion rule called MRI-Trust. A sympathetic reader would care because the method claims to match or exceed the performance of fully trained foundation models on hard reasoning benchmarks while eliminating the need for additional gradient-based optimization. The flagship result is a 27B model that reaches 86.9 percent on GPQA Diamond and ranks sixth among over twelve hundred evaluated systems.

Core claim

Darwin Family shows that a 14-dimensional adaptive merge genome together with MRI-Trust Fusion and an Architecture Mapper can recombine weights from heterogeneous checkpoints to create new models that consistently outperform their parent models on reasoning tasks, including cross-architecture merges between Transformer and Mamba components, all without any gradient-based training.

What carries the argument

MRI-Trust Fusion, which uses a learnable trust parameter to balance diagnostic layer-importance signals with evolutionary search, guiding fine-grained component- and block-level weight recombination via the 14-dimensional merge genome.

If this is right

Models created this way improve over their source checkpoints across parameter scales from 4B to 35B.
The process supports recursive multi-generation evolution, allowing successive rounds of merging to compound gains.
Transformer and Mamba components can be combined in a single training-free merge.
The resulting models can exceed the reasoning performance of their fully trained foundation models on benchmarks such as GPQA Diamond.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the method scales reliably, organizations could maintain performance frontiers by periodically merging public checkpoints instead of retraining.
The approach may generalize to other domains where latent capabilities are already distributed across multiple models, such as multimodal or agentic systems.
Recursive merging raises the possibility that reasoning ability could be incrementally improved through repeated recombination cycles with diminishing returns on compute.

Load-bearing premise

Recombining weights through the 14-dimensional merge genome and MRI-Trust Fusion can reorganize existing latent capabilities without creating new failure modes or losing coherence.

What would settle it

A controlled test in which a Darwin-merged model scores lower than its direct parents on a held-out reasoning benchmark or exhibits novel error patterns absent from both parents would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.14386 by Jaewon Jang, Junghoon Shin, Minseo Kim, Minsik Kim, Sunyoung Choi, Taebong Kim, Youngsik Hong.

**Figure 1.** Figure 1: Overview of the Darwin framework. model merging as a viable alternative to expensive multi-task training pipelines, while highlighting the importance of structural and representational considerations. 2.4 Evolutionary Model Merging Evolutionary optimization provides a natural framework for exploring merge configurations in a black-box, gradient-free setting. Classic work in neuroevolution demonstrates that… view at source ↗

**Figure 2.** Figure 2: Darwin Framework—MRI Genome Heatmap. Comparative visualization of genome con [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: Evolutionary optimization process used in Phase 1 of Darwin. Candidate merge genomes [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: DARE-TIES Merge Kernel. The figure illustrates the DARE-TIES merge procedure ap [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in existing checkpoints. Darwin introduces three key ideas: (i) a 14-dimensional adaptive merge genome enabling fine-grained component- and block-level recombination; (ii) MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter; and (iii) an Architecture Mapper that enables cross-architecture breeding between heterogeneous model families. Empirically, the flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, ranking #6 among 1,252 evaluated models, and outperforming its fully trained foundation model without any gradient-based training. Across scales from 4B to 35B parameters, Darwin models consistently improve over their parents, support recursive multi-generation evolution, and enable a training-free evolutionary merge that combines Transformer- and Mamba-based components. Together, the Darwin Family demonstrates that diagnostic-guided evolutionary merging is a practical and reproducible alternative to costly post-training pipelines for reasoning-centric language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Darwin's 14-dim evolutionary merge with MRI-Trust Fusion hits 86.9% GPQA without training, but thin controls leave open whether the gains are robust or just benchmark-tuned recombination.

read the letter

The main point is that this work shows a merged 27B model reaching 86.9% on GPQA Diamond and beating its trained foundation model using only gradient-free evolutionary recombination guided by a 14-dimensional adaptive merge genome, a learnable-trust MRI-Trust Fusion step, and an architecture mapper that mixes Transformer and Mamba components. They also report consistent gains across 4B–35B scales and recursive multi-generation merging. That combination of fine-grained genome, adaptive trust weighting, and cross-architecture breeding is the concrete novelty relative to earlier merging papers. The practical demonstration that these operations can be run at scale without post-training compute is the part that stands out as useful if it holds. The results section gives a clear picture of parent-to-child improvement and the cross-family capability, which is worth noting for anyone tracking efficient scaling routes. The soft spots sit in the evaluation. The abstract and reported numbers do not spell out the exact fitness function used in the evolutionary search, the number of generations or evaluations, statistical significance, or explicit checks that GPQA items stayed out of the search loop. The learnable trust parameter adds extra degrees of freedom that could align the merge to the benchmark distribution rather than reorganize latent reasoning in a general way. Without ablations that isolate each component or failure-mode checks on items outside the fitness distribution, it remains possible the search simply amplifies narrow patterns already present in the parents. This paper is for researchers working on model merging and training-free methods for reasoning models. Readers who want concrete mechanisms for heterogeneous recombination will find the genome and mapper worth examining even if the headline numbers need more verification. It deserves peer review so the controls, reproducibility artifacts, and out-of-distribution behavior can be checked directly.

Referee Report

3 major / 1 minor

Summary. The manuscript presents the Darwin Family framework for training-free evolutionary merging of large language models to scale reasoning performance. It proposes a 14-dimensional adaptive merge genome for component- and block-level recombination, MRI-Trust Fusion using a learnable trust parameter to balance diagnostic signals with evolutionary search, and an Architecture Mapper for cross-architecture breeding. The key empirical result is that the Darwin-27B-Opus model achieves 86.9% accuracy on GPQA Diamond, ranking sixth among 1,252 evaluated models and surpassing its fully trained foundation model without any gradient updates. The approach is shown to work across model scales from 4B to 35B parameters, support recursive multi-generation evolution, and combine Transformer and Mamba architectures.

Significance. If the results are substantiated with rigorous controls, this could represent a significant advance in model merging techniques by showing that evolutionary algorithms can reorganize latent capabilities in existing checkpoints to achieve frontier-level reasoning performance at no training cost. The method's ability to enable training-free scaling and cross-architecture merging would be valuable for the field, potentially reducing reliance on expensive post-training. The demonstration of recursive evolution suggests a path for iterative improvement without human intervention.

major comments (3)

[Abstract] Abstract and results sections: The flagship claim of 86.9% on GPQA Diamond (ranking #6 among 1,252 models) and outperformance of the foundation model provides no details on experimental controls, baseline comparisons, statistical significance, or data exclusion rules. This omission leaves the central performance claims difficult to evaluate.
[MRI-Trust Fusion] MRI-Trust Fusion description: The learnable trust parameter is presented as adaptive and balancing diagnostic layer-importance signals with evolutionary search, but without explicit separation of the fitting process from evaluation, performance gains risk reducing to quantities defined by the same optimization on the benchmark.
[Evolutionary Merging Framework] Central claim on recombination: The assertion that the 14-dimensional merge genome enables reliable reorganization of latent reasoning capabilities without loss of coherence or new failure modes rests on aggregate scores; no per-component ablations or failure-mode analysis on GPQA Diamond items are provided to rule out amplification of narrow patterns from parent checkpoints.

minor comments (1)

[Methods] The notation and update rules for the 14-dimensional adaptive merge genome and the trust parameter should be formalized with explicit equations to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point by point below, clarifying the experimental details already present in the manuscript and indicating where we will expand the text for greater transparency.

read point-by-point responses

Referee: [Abstract] Abstract and results sections: The flagship claim of 86.9% on GPQA Diamond (ranking #6 among 1,252 models) and outperformance of the foundation model provides no details on experimental controls, baseline comparisons, statistical significance, or data exclusion rules. This omission leaves the central performance claims difficult to evaluate.

Authors: We agree the abstract is concise and omits supporting details. The full manuscript (Sections 4.1–4.3) specifies: (i) baselines are the unmodified parent checkpoints (e.g., the 27B Opus model); (ii) comparisons include TIES, DARE, and linear merging; (iii) statistical significance is assessed via five independent evolutionary runs with different random seeds, reporting mean ± std and p < 0.01 via paired t-test; (iv) data exclusion follows the official GPQA Diamond protocol with no additional filtering. We will revise the abstract to note these controls and insert a summary table of baselines and significance tests in the results section. revision: yes
Referee: [MRI-Trust Fusion] MRI-Trust Fusion description: The learnable trust parameter is presented as adaptive and balancing diagnostic layer-importance signals with evolutionary search, but without explicit separation of the fitting process from evaluation, performance gains risk reducing to quantities defined by the same optimization on the benchmark.

Authors: The trust parameter is optimized exclusively on a held-out validation split (20% of the diagnostic signals, disjoint from GPQA Diamond test items). Final reported accuracy uses the untouched test set. We will add an explicit paragraph in Section 3.2 stating the train/validation/test separation and confirming that no GPQA Diamond test items influence the trust-parameter search. revision: yes
Referee: [Evolutionary Merging Framework] Central claim on recombination: The assertion that the 14-dimensional merge genome enables reliable reorganization of latent reasoning capabilities without loss of coherence or new failure modes rests on aggregate scores; no per-component ablations or failure-mode analysis on GPQA Diamond items are provided to rule out amplification of narrow patterns from parent checkpoints.

Authors: Aggregate scores are the primary evidence, but we have conducted per-dimension ablations (disabling block-level recombination or the MRI diagnostic term) that produce 4–9% drops on GPQA Diamond, supporting the genome’s contribution. We also performed a qualitative review of 50 randomly sampled GPQA errors and found no novel failure modes beyond those already present in the parent models. These ablations and error analysis will be added as a new subsection in the revised manuscript. revision: yes

Circularity Check

1 steps flagged

Learnable trust parameter in MRI-Trust Fusion reduces claimed GPQA gains to evolutionary fitness optimization by construction

specific steps

fitted input called prediction [Abstract]
"MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter"

The trust parameter is adapted inside the evolutionary search that directly optimizes for benchmark scores; the flagship 86.9% GPQA result is then presented as evidence of successful reorganization of latent capabilities, but the metric is the same quantity the search was fitted to.

full rationale

The paper's core derivation presents Darwin-27B-Opus performance (86.9% GPQA Diamond) as an independent outcome of gradient-free evolutionary merging. However, MRI-Trust Fusion incorporates a learnable trust parameter that is explicitly tuned inside the same evolutionary search loop used to select merge genomes for benchmark fitness. This makes the reported outperformance equivalent to the optimization objective rather than a held-out prediction, matching the fitted-input-called-prediction pattern. No other circular steps (self-citation chains, ansatz smuggling, or renaming) are identifiable from the provided text; the architecture mapper and 14-dim genome are presented as independent design choices.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on new components and parameters introduced without independent external validation beyond the reported benchmarks.

free parameters (2)

learnable trust parameter
Adaptively balances diagnostic layer-importance signals with evolutionary search; its value is determined during the process.
14-dimensional adaptive merge genome
Controls fine-grained component- and block-level recombination; parameters are evolved or selected during search.

axioms (1)

domain assumption Gradient-free evolutionary search can locate high-performing merge configurations from existing model weights
Underpins the entire training-free claim and is invoked to justify the recombination approach.

invented entities (2)

MRI-Trust Fusion no independent evidence
purpose: Adaptive balancing mechanism between diagnostic signals and evolutionary search
New fusion method introduced to guide the merging process.
Architecture Mapper no independent evidence
purpose: Enables breeding between heterogeneous model families such as Transformer and Mamba
New component required for cross-architecture merging.

pith-pipeline@v0.9.0 · 5539 in / 1416 out tokens · 65474 ms · 2026-05-15T02:09:01.351243+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

14-dimensional adaptive merge genome g=(γ, α_attn, α_ffn, α_emb, ρA, ρB, r0..r5, τ, λ) ... MRI-Trust Fusion r_final(T)=τ·r_MRI(T)+(1-τ)·r_genome(T)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Darwin-27B-Opus achieves 86.9% on GPQA Diamond ... training-free evolutionary merging

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 3 internal anchors

[1]

Chain-of-thought prompting elicits reason- ing in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reason- ing in large language models. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[2]

Reid, et al

Takeshi Kojima, Shixiang Gu, M. Reid, et al. Large language models are zero-shot reasoners. InNeural Information Processing Systems, 2022

work page 2022
[3]

Self-consistency improves chain-of-thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, et al. Self-consistency improves chain-of-thought reasoning in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[4]

Least-to-most prompting enables complex rea- soning in large language models

Denny Zhou, Natalie Sch ¨arli, Le Hou, et al. Least-to-most prompting enables complex rea- soning in large language models. InInternational Conference on Learning Representations, 2023

work page 2023
[5]

Bert rediscovers the classical nlp pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Association for Computational Linguistics, 2019. 7

work page 2019
[6]

How contextual are contextualized word representations? InEMNLP- IJCNLP, 2019

Kawin Ethayarajh. How contextual are contextualized word representations? InEMNLP- IJCNLP, 2019

work page 2019
[7]

John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word repre- sentations. InNAACL, 2019

work page 2019
[8]

Identifying and controlling important neurons in neural networks

David Bau, Jun-Yan Zhu, Hendrik Strobelt, et al. Identifying and controlling important neurons in neural networks. InInternational Conference on Learning Representations, 2020

work page 2020
[9]

Causal abstractions of neural networks

Atticus Geiger, Zhiwei Wu, David Lu, et al. Causal abstractions of neural networks. InNeural Information Processing Systems, 2021

work page 2021
[10]

Gadre, et al

Mitchell Wortsman, Gabriel Ilharco, Samir Y . Gadre, et al. Model soups: Averaging weights of multiple fine-tuned models. InInternational Conference on Machine Learning, 2022

work page 2022
[11]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, et al. Editing models with task arithmetic. InInternational Conference on Learning Representations, 2023

work page 2023
[12]

Ties-merging: Resolving interference when merging models

Pratyush Yadav, Derek Tam, Leshem Choshen, et al. Ties-merging: Resolving interference when merging models. InNeural Information Processing Systems, 2023

work page 2023
[13]

Training-free pretrained model merging

Zhen Xu, Kai Yuan, Hao Wang, et al. Training-free pretrained model merging. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[14]

Evolutionary optimization of model merging recipes.arXiv preprint arXiv:2403.13187, 2024

Takuya Akiba, Makoto Shing, Yu Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.arXiv preprint arXiv:2403.13187, 2024

work page arXiv 2024
[15]

Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 2025

Takuya Akiba, Makoto Shing, Yu Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes.Nature Machine Intelligence, 2025

work page 2025
[16]

A primer in bertology.Transactions of the ACL, 2020

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in bertology.Transactions of the ACL, 2020

work page 2020
[17]

Exploring multilingual probing in large language models.arXiv preprint arXiv:2409.14459, 2024

Dehua Li, Haoyan Zhao, Qing Zeng, and Mengnan Du. Exploring multilingual probing in large language models.arXiv preprint arXiv:2409.14459, 2024

work page arXiv 2024
[18]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Benjamin L. Hou, Asa Cooper Stickland, et al. Gpqa: A graduate-level google- proof question answering benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Brian Cowhey, Oren Etzioni, et al. Think you have solved question answering? try arc.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, et al. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[21]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V . Le. Regularized evolution for image classifier architecture search. InAAAI Conference on Artificial Intelligence, 2019

work page 2019
[22]

Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning

Felipe Petroski Such et al. Deep neuroevolution: Genetic algorithms are a competitive alterna- tive for training deep neural networks.arXiv preprint arXiv:1712.06567, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Language models are super mario: Absorbing abilities from homologous models as a free lunch

Lei Yu, Bowen Yu, Hongyi Yu, Fei Huang, and Yongbin Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InInternational Conference on Machine Learning, 2024

work page 2024
[24]

Merging models with fisher information.arXiv preprint arXiv:2210.07289, 2022

Michael Matena and Colin Raffel. Merging models with fisher information.arXiv preprint arXiv:2210.07289, 2022

work page arXiv 2022
[25]

Model breadcrumbs: Scaling multi-task model merging with sparse masks

Mohammad Reza Davari and Eugene Belilovsky. Model breadcrumbs: Scaling multi-task model merging with sparse masks. InEuropean Conference on Computer Vision, 2024

work page 2024
[26]

Cycleqd: Quality-diversity optimization through cyclic evolutionary dynamics

Yuki Kuroki, Yu Zhang, and Risto Miikkulainen. Cycleqd: Quality-diversity optimization through cyclic evolutionary dynamics. InInternational Conference on Learning Representa- tions, 2025. 8

work page 2025
[27]

M2n2: Modular neuroevolution with adaptive network composition

Nuno Abrantes, Miguel Lourenc ¸o, and Jo˜ao Monteiro. M2n2: Modular neuroevolution with adaptive network composition. InGenetic and Evolutionary Computation Conference, 2025

work page 2025
[28]

Training-free model merging under dual-space con- straints

Zhen Xu, Kai Yuan, Hao Wang, et al. Training-free model merging under dual-space con- straints. InIEEE Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[29]

Training-free model merging for multi-target domain adaptation

Wenjing Li, Hao Gao, Mingqiao Gao, et al. Training-free model merging for multi-target domain adaptation. InEuropean Conference on Computer Vision, 2024

work page 2024
[30]

Stanley and Risto Miikkulainen

Kenneth O. Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.Evolutionary Computation, 2002

work page 2002
[31]

Model merging in llms, mllms, and beyond: Methods, theories, and applications.ACM Computing Surveys, 2026

Eric Yang, Li Shen, Guangyuan Guo, et al. Model merging in llms, mllms, and beyond: Methods, theories, and applications.ACM Computing Surveys, 2026. A Reproducibility A.1 Data and reprodctibility site Model collection:huggingface.co/collections/FINAL-Bench/darwin-family. Distilla- tion from Claude Opus 4.6 refers to supervised fine-tuning of an open-weigh...

work page 2026