arxiv: 2512.04695 · v3 · submitted 2025-12-04 · 💻 cs.LG

Recognition: no theorem link

TRINITY: An Evolved LLM Coordinator

Jinglue Xu , Qi Sun , Peter Schwendeman , Stefan Nielsen , Edoardo Cetin , Yujin Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM coordinationmulti-model collaborationevolutionary optimizationagentic systemsbenchmark performancerole delegation

0 comments

The pith

A small evolved coordinator directs multiple LLMs by assigning Thinker, Worker, or Verifier roles across turns, beating single models and prior methods on coding, math, and reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Trinity as a lightweight system that combines several large language models without merging their weights or accessing their internals. A compact coordinator model, roughly 0.6 billion parameters plus a tiny head, learns through an evolutionary strategy to decide which model handles each step of a task. By breaking queries into multiple turns and delegating roles, the approach shifts skill learning away from the coordinator itself. Experiments report consistent gains over individual models and existing combination techniques on standard coding, math, reasoning, and knowledge benchmarks, including 86.2 percent on LiveCodeBench, with strong generalization to new distributions.

Core claim

Trinity achieves state-of-the-art results by using a compact language model and lightweight head as coordinator, optimized with separable Covariance Matrix Adaptation Evolution Strategy. The coordinator assigns one of three roles—Thinker, Worker, or Verifier—to a chosen LLM at each turn. Performance stems from the coordinator's hidden-state representations supplying rich input contextualization and from the evolutionary method exploiting block-epsilon-separability to surpass reinforcement learning, imitation learning, and random search under high dimensionality and limited budget.

What carries the argument

The lightweight coordinator (0.6B-parameter model plus 10K-parameter head) that processes queries over multiple turns and delegates roles to selected LLMs, trained via separable Covariance Matrix Adaptation Evolution Strategy.

If this is right

Trinity outperforms single foundation models and prior combination methods on coding, math, reasoning, and domain-knowledge tasks.
The system generalizes robustly to out-of-distribution queries.
Hidden-state representations inside the coordinator deliver richer contextualization than direct prompting.
Separable Covariance Matrix Adaptation Evolution Strategy outperforms reinforcement learning, imitation learning, and random search when dimensionality is high and evaluation budget is limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing closed-API models can be reused as-is without weight access or retraining.
The multi-turn role delegation pattern could extend to other agentic workflows beyond the tested benchmarks.
If block-epsilon-separability holds more broadly, similar evolutionary coordinators might apply to non-LLM model ensembles.

Load-bearing premise

The coordinator's hidden-state representations supply sufficiently rich contextualization and the evolutionary strategy exploits block-epsilon-separability to outperform reinforcement learning, imitation learning, and random search under the given high-dimensionality and budget constraints.

What would settle it

Replace the evolved coordinator with either random role assignment or a standard reinforcement-learning optimizer and measure whether accuracy on LiveCodeBench, math, and reasoning suites falls below the reported Trinity scores.

Figures

Figures reproduced from arXiv: 2512.04695 by Edoardo Cetin, Jinglue Xu, Peter Schwendeman, Qi Sun, Stefan Nielsen, Yujin Tang.

**Figure 1.** Figure 1: Overview and an example of our coordination method. Left: The cyclical coordination architecture. In each turn, the full conversation transcript is passed to a compact coordinator model. A lightweight head selects an LLM and assigns it one of three roles: Thinker (T), Worker (W), or Verifier (V). A message processing module injects a role-specific prompt before the request is sent to the chosen LLM. Right:… view at source ↗

**Figure 2.** Figure 2: Parametrization of the TRINITY coordinator. A lightweight head (see Appendix ??) operates in parallel to the base model’s LM head. It takes the hidden state h corresponding to the penultimate output token as its sole input. This head fθ is responsible for coordination decisions, producing two sets of logits, one to select an LLM from the pool of L models, and another to assign one of three roles. We also f… view at source ↗

**Figure 3.** Figure 3: TRINITY outperforms single- and multi-model baselines across four benchmarks. Our approach (boldface on the x-axis) achieves the highest performance across four tasks, surpassing the baseline methods. In Math500, MMLU and LiveCodeBench, our performance is close to “PerQuestion-Best”, representing an upper bound achieved by taking the union of all correct answers from the single LLMs. 4.2 IN-DISTRIBUTION E… view at source ↗

**Figure 4.** Figure 4: LiveCodeBench Results. Top: TRINITY achieves state-of-theart. Bottom: TRINITY benefits from increasing maximum turns budgets. This suggests that TRINITY does more than simply select the best agent for a task. To assess TRINITY’s generalization capability, we tested its zero-shot performance on four held-out benchmarks. As summarized in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Task type separability in extracted hidden states. Both are based on penultimate-token hidden states processed by the SLM on the input sequence, and the labels are from the task metadata. Appendix A.3 reports the full results, and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: LLM selection distribution evolves as the coordinator learning progresses. Left: Distribution evolution from sep-CMA-ES. Right: Distribution evolution from REINFORCE. weights of parent models are merged into a child model, and macro-level fusion in data-flow space, where activations or outputs are passed across fixed models or model components. Micro-level. Early approaches in micro-level utilize on static… view at source ↗

**Figure 7.** Figure 7: Agent selection distribution by task. Percentage of datapoints where each agent was selected by the trained coordinator. A.2.2 COST IN LABEL GENERATION The cost profiles of SFT and label-free training methods, such as sep-CMA-ES, REINFORCE, and RS differ substantially. For SFT, the dominant cost lies in label generation. Labels can be produced at reasonable cost for a direct mapping from representation spa… view at source ↗

**Figure 8.** Figure 8: PCA analysis. All four plots demonstrate clear clustering patterns. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: LDA analysis. The Fisher’s ratios indicate that the between-class scatter is approximately two to three times greater than the within-class scatter. 10 5 0 5 10 15 20 UMAP 1 10 5 0 5 10 15 20 UMAP 2 Agent Selection - from SLM 0 10 20 UMAP 1 10 5 0 5 10 15 20 UMAP 2 Agent Selection - from Head 10 5 0 5 10 15 20 UMAP 1 10 5 0 5 10 15 20 UMAP 2 Task Type - from SLM 0 10 20 UMAP 1 10 5 0 5 10 15 20 UMAP 2 Task… view at source ↗

**Figure 10.** Figure 10: UMAP analysis. The clustering patterns indicate strong non-linear separability. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: t-SNE analysis. The analysis demonstrates particularly strong separability of task types in the hidden states extracted from the SLM. 30 20 10 0 10 20 30 Principal Component 1 20 10 0 10 20 30 Principal Component 2 Accuracy: 1.000 ± 0.000 Random: 0.250 Samples: 1698 Linear SVM - Task Type 30 20 10 0 10 20 30 Principal Component 1 20 10 0 10 20 30 Principal Component 2 Accuracy: 1.000 ± 0.000 Random: 0.250… view at source ↗

**Figure 12.** Figure 12: SVM analysis on hidden states extracted from the SLM. Classification accuracies: Linear SVM (task type) = 1.000, RBF SVM (task type) = 1.000, Linear SVM (agent selection) = 0.713, RBF SVM (agent selection) = 0.776. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: SVM analysis on output logits. Classification accuracies: Linear SVM (task type) = 0.945, RBF SVM (task type) = 0.955, Linear SVM (agent selection) = 0.786, RBF SVM (agent selection) = 0.783. From Figures 8–11, both linear (PCA/LDA) and non-linear (UMAP/t-SNE) views reveal clear structure. LDA’s reported Fisher ratios (between/within scatter ≈2–3×) corroborate that much of the variance aligns with task-d… view at source ↗

**Figure 14.** Figure 14: Separability index vs head classification accuracy. Trained on synthetic datasets with systematically varied separability, the head linear exhibits a strong positive correlation between separability index and test classification accuracy. We train the exact same head used in our experiments, linear, on these synthetic datasets [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Agent distribution over tasks. A0: GPT-5, A1: Claude-Sonnet-4-20250514, A2: Gemini-2.5-pro, A3: DeepSeek-R1-Distill-Qwen-32B, A4: Gemma-3-27b-It, A5: Qwen3-32B (reasoning), A6: Qwen/Qwen3-32B (direct). TRINITY demonstrates strong task-aware agent selection strategy. A.7.3 ADDITIONAL BASELINE RESULTS. Parallel Sampling. We report additional baselines using majority voting over 5 samples per question. Whi… view at source ↗

read the original abstract

Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large language models (LLMs). The coordinator, comprising a compact language model (approximately $0.6$B parameters) and a lightweight head (approximately $10$K parameters), is optimized with an evolutionary strategy for efficient and adaptive delegation. Trinity processes queries over multiple turns, where at each turn the coordinator assigns one of three roles (Thinker, Worker, or Verifier) to a selected LLM, effectively offloading complex skill acquisition from the coordinator itself. Experiments show that Trinity consistently outperforms individual models and existing methods across coding, math, reasoning, and domain knowledge tasks, and generalizes robustly to out-of-distribution tasks. On standard benchmarks, Trinity achieves state-of-the-art results, including a score of 86.2% on LiveCodeBench. Theoretical and empirical analyses identify two main factors behind this performance: (1) the coordinator's hidden-state representations provide rich contextualization of inputs, and (2) under high dimensionality and strict budget constraints, the separable Covariance Matrix Adaptation Evolution Strategy offers advantages over reinforcement learning, imitation learning, and random search by exploiting potential block-epsilon-separability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Trinity, a lightweight coordinator (~0.6B-parameter LM plus ~10K-parameter head) that orchestrates multiple LLMs over multiple turns by dynamically assigning one of three roles (Thinker, Worker, Verifier) to selected models. The coordinator is optimized via a separable Covariance Matrix Adaptation Evolution Strategy (CMA-ES). The paper reports consistent outperformance over individual models and prior methods on coding, math, reasoning, and domain-knowledge benchmarks, with a claimed SOTA score of 86.2% on LiveCodeBench and robust generalization to out-of-distribution tasks. Performance is attributed to two factors: rich contextualization provided by the coordinator’s hidden-state representations and advantages of separable CMA-ES over RL, imitation learning, and random search under high dimensionality and budget constraints via exploitation of block-epsilon-separability.

Significance. If the reported gains are reproducible and the two explanatory factors are isolated by proper controls, the work offers a practical route to combining heterogeneous LLMs without weight merging or access to internal parameters, which is especially relevant for closed APIs. The multi-turn role-assignment mechanism and the use of an evolutionary optimizer in place of RL constitute concrete engineering contributions that could be adopted more broadly. The explicit identification of representation richness and optimizer choice as load-bearing factors is a strength that invites targeted follow-up.

major comments (2)

[Abstract and §5] Abstract and §5 (Experimental Results): the central claim of SOTA performance (e.g., 86.2% on LiveCodeBench) and attribution to the two factors rests on summarized numbers without reported error bars, standard deviations across runs, or ablation tables that isolate the contribution of hidden-state contextualization versus the separable CMA-ES. Without these controls the performance advantage cannot be confidently linked to the stated mechanisms rather than to unaccounted differences in total inference budget or prompt engineering.
[§4] §4 (Theoretical Analysis): the assertion that separable CMA-ES exploits block-epsilon-separability to outperform RL, imitation learning, and random search under the given dimensionality and budget constraints is presented as a key explanatory factor, yet the manuscript supplies neither a formal definition of block-epsilon-separability nor a derivation showing how the separability property yields the observed sample-efficiency gains. A concrete inequality or convergence argument tied to the empirical optimizer comparison is required.

minor comments (2)

[§3] Clarify the precise architecture of the lightweight head (approximately 10K parameters) and how its output interfaces with the role-assignment and model-selection decisions; a small diagram or pseudocode would remove ambiguity.
[Table 2] Ensure that all baseline comparisons in the tables explicitly state whether the same total number of LLM calls or wall-clock budget was used; otherwise the reported gains may partly reflect differences in inference cost rather than algorithmic superiority.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional controls, reporting, and formalization where appropriate.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Experimental Results): the central claim of SOTA performance (e.g., 86.2% on LiveCodeBench) and attribution to the two factors rests on summarized numbers without reported error bars, standard deviations across runs, or ablation tables that isolate the contribution of hidden-state contextualization versus the separable CMA-ES. Without these controls the performance advantage cannot be confidently linked to the stated mechanisms rather than to unaccounted differences in total inference budget or prompt engineering.

Authors: We agree that statistical reporting and isolating ablations are necessary to support the attribution. In the revised manuscript we now report means with standard deviations over five independent runs for all main results, including LiveCodeBench. We have added ablation tables in §5 that compare the full system against (i) a prompt-only coordinator without hidden-state contextualization and (ii) random search in place of separable CMA-ES, while holding total inference budget and prompt templates fixed. These controls show that both factors contribute measurably to the gains; the combined system remains highest. We believe the revisions now allow confident linkage of performance to the stated mechanisms. revision: yes
Referee: [§4] §4 (Theoretical Analysis): the assertion that separable CMA-ES exploits block-epsilon-separability to outperform RL, imitation learning, and random search under the given dimensionality and budget constraints is presented as a key explanatory factor, yet the manuscript supplies neither a formal definition of block-epsilon-separability nor a derivation showing how the separability property yields the observed sample-efficiency gains. A concrete inequality or convergence argument tied to the empirical optimizer comparison is required.

Authors: We acknowledge the request for greater formality. The revised §4 now supplies an explicit definition: an objective is block-ε-separable if it admits a partition of parameters into blocks such that the function value differs from the sum of per-block functions by at most ε. We include a short derivation showing that, under this property, independent covariance adaptation per block reduces effective dimensionality and yields a sample-complexity scaling of O(B log(1/δ)/ε²) (B = number of blocks) versus the higher variance incurred by RL or random search in the full space. The argument is tied directly to the empirical optimizer curves by noting that the observed faster convergence of separable CMA-ES is consistent with the reduced variance predicted by the bound when role-assignment objectives exhibit approximate block separability. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports empirical benchmark results (e.g., 86.2% on LiveCodeBench) and identifies two explanatory factors via theoretical and empirical analyses: the coordinator's hidden-state representations for contextualization and the separable CMA-ES optimizer's advantages over RL/imitation/random search under high dimensionality and budget constraints by exploiting block-epsilon-separability. No equations, derivations, or self-citations are shown that reduce any prediction or first-principles result to fitted inputs by construction. The evolutionary strategy is presented as a standard named variant applied to a lightweight coordinator, with performance claims grounded in direct experimental outcomes and factor isolation rather than self-referential definitions or load-bearing self-citation chains. The derivation chain remains self-contained through reported results and analyses without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only. The approach assumes standard LLM forward passes produce usable hidden states for contextualization and that evolutionary search can locate effective delegation policies under the given constraints; no explicit new axioms or entities are introduced beyond the three roles.

pith-pipeline@v0.9.0 · 5535 in / 1134 out tokens · 42211 ms · 2026-05-17T01:14:51.929864+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms
cs.LG 2026-05 unverdicted novelty 6.0

GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new num...

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Physics of language models: Part 3.1, knowledge storage and extraction.arXiv preprint arXiv:2309.14316,

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction.arXiv preprint arXiv:2309.14316,

work page arXiv
[2]

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al

Ac- cessed: 2025-08-29. Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues.arXiv preprint arXiv:2402.14762,

work page arXiv 2025
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Arcee’s MergeKit: A toolkit for merg- ing large language models

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee’s MergeKit: A toolkit for merg- ing large language models. In Franck Dernoncourt, Daniel Preot ¸iuc-Pietro, and Anastasia Shimorina (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Lan- ...

work page 2024
[5]

doi: 10.18653/v1/2024.emnlp-industry.36

Associ- ation for Computational Linguistics. doi: 10.18653/v1/2024.emnlp-industry.36. URLhttps: //aclanthology.org/2024.emnlp-industry.36/. Neel Guha, Mayee Chen, Trevor Chow, Ishan Khare, and Christopher Re. Smoothie: Label free lan- guage model routing.Advances in Neural Information Processing Systems, 37:127645–127672,

work page doi:10.18653/v1/2024.emnlp-industry.36 2024
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[8]

Training Compute-Optimal Large Language Models

12 Published as a conference paper at ICLR 2026 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[11]

Adam: A Method for Stochastic Optimization

URL https://arxiv.org/abs/1412.6980. So Kuroki, Taishi Nakamura, Takuya Akiba, and Yujin Tang. Agent skill acquisition for large language models via cycleqd.arXiv preprint arXiv:2410.14735,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi- agent debate.arXiv preprint arXiv:2305.19118,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Learning to route among special- ized experts for zero-shot generalization.arXiv preprint arXiv:2402.05859,

Mohammed Muqeeth, Haokun Liu, Yufan Liu, and Colin Raffel. Learning to route among special- ized experts for zero-shot generalization.arXiv preprint arXiv:2402.05859,

work page arXiv
[15]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Di- rani, Julian Michael, and Samuel R Bowman

Accessed: 2025-08-29. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Di- rani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a bench- mark. InFirst Conference on Language Modeling,

work page 2025
[16]

Gemma 3 Technical Report

URLhttps:// openreview.net/forum?id=dh4t9qmcvK. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram´e, Morgane Rivi`ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Aime problem set 1983-2024,

Hemish Veeraboina. Aime problem set 1983-2024,

work page 1983
[18]

Mixture-of-Agents Enhances Large Language Model Capabilities

URLhttps://www.kaggle.com/ datasets/hemishveeraboina/aime-problem-set-1983-2024. Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities.arXiv preprint arXiv:2406.04692,

work page internal anchor Pith review Pith/arXiv arXiv 1983
[19]

Qwen3 Technical Report

13 Published as a conference paper at ICLR 2026 An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Enneng Yang, Li Shen, Guibing Guo, Xingwei Wang, Xiaochun Cao, Jie Zhang, and Dacheng Tao. Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities. arXiv preprint arXiv:2408.07666,

work page internal anchor Pith review arXiv
[21]

Rlpr: Extrapolating rlvr to general domains without verifiers.arXiv preprint arXiv:2506.18254,

Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, et al. Rlpr: Extrapolating rlvr to general domains without verifiers.arXiv preprint arXiv:2506.18254,

work page arXiv
[22]

Masrouter: Learning to route llms for multi-agent systems.arXiv preprint arXiv:2502.11133,

Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, and Yiyan Qi. Masrouter: Learning to route llms for multi-agent systems.arXiv preprint arXiv:2502.11133,

work page arXiv
[23]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench- marking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

14 Published as a conference paper at ICLR 2026 A APPENDIX A.1 THEORETICAL ANALYSIS OF SEP-CMA-ES In this section, we compare sep-CMA-ES with random search (RS) for maximizingJoverPunder binary rewards and strict budgets. All analyses are carried out in a covariance-normalized chart and mapped back through the current diagonalD t, fixing the metric mismat...

work page 2026
[25]

Blocks, scaling, and operators.Let{B 1,

The atomic budgetB env counts Bernoulli calls. Blocks, scaling, and operators.Let{B 1, . . . , BM }partition{1, . . . , n}(coordinate blocks inP). For any matrixM,off(M)zeroes its diagonal;off inter(M)zeroes diagonal and within-block entries. For diagonalD, lets max(D),s min(D)be its largest/smallest diagonal square-roots and define κD := smax(D)2 smin(D)...

work page 2026
[26]

the signal-to-curvature ratio is order 1/εH, giving an exponential suppression of curvature-induced flips. To scale this pairwise guarantee to batch selection, restrict attention to theO(logN)(RS) orO(logλ)(CMA) most competitive order statistics: by extreme-value theory, the typical spacing between the winner and the next competitors isΘ(1/ √ lnN), and un...

work page 2026
[27]

This givesv 2 N ≈2 lnN

Withm CMA = 16andm RS = 32, budget matching acrossTCMA iterations yieldsN= (mCMAλ/mRS)T = (16·32/32)T ≈ ⌊16T⌋. This givesv 2 N ≈2 lnN. Replication en- sures˜ρ2 CMA ≈1(up toO(ε H)). Plugging these into equation 4 shows that with the sameB env CMA’s gain dominates for modestT(a few to a few dozen iterations), consistent with empirical results where the head...

work page 2026
[28]

(ii)Uniform per-iteration gain:Insert these bounds into equation 2 to getE[r 2 t+1 |r t]≤ (1−¯κµ,λ ˜ρ2 CMA/n(1−O(ε H)))r2 t ; iterate to obtain geometric decay with rateΩ(1/n)

Proof.(i)Scale stabilization:Withc cov = Θ(1/n)and block-ε H separability plus diagonal comparability, standard CMA drift showsD t reaches anO(ε H)-neighborhood of a stationary point inT 0 = Θ(n)steps; thenκ D(t) = Θ(1)and typicalχ(u t, Dt) = Θ(1). (ii)Uniform per-iteration gain:Insert these bounds into equation 2 to getE[r 2 t+1 |r t]≤ (1−¯κµ,λ ˜ρ2 CMA/n...

work page 2026
[29]

By contrast, label-free training methods such as sep-CMA-ES require no explicit label generation and instead optimize the coordinator directly based on task rewards

In total, this yields a multiplicative factor of7 4 ·3 5 = 583,443≈5.8×10 5, inflating the cost to an enormous1.5×10 5 ×5.8×10 5 ≈ 19 Published as a conference paper at ICLR 2026 8.7×10 10 LLM queries. By contrast, label-free training methods such as sep-CMA-ES require no explicit label generation and instead optimize the coordinator directly based on tas...

work page 2026
[30]

We initialize with Xavier-uniform (Glorot & Bengio,

This choice can result inmoreparameters than a strictly compressed low-rank setting, but it intentionally adds depth and nonlinearity so the head can capture non-linear patterns at 24 Published as a conference paper at ICLR 2026 reduced per-projection cost versus a single wide mapping. We initialize with Xavier-uniform (Glorot & Bengio,

work page 2026
[31]

This is roughly a2×increase overlinear, trading parameter efficiency for additional depth and non-linearity in the mapping from hidden states to logits

and an ELU nonlinearity, increasing the head size to20,680parameters. This is roughly a2×increase overlinear, trading parameter efficiency for additional depth and non-linearity in the mapping from hidden states to logits. 25 Published as a conference paper at ICLR 2026 A.5 EXPERIMENTATION WITH LEARNING ALGORITHMS We also compare our learning strategy wit...

work page 2026