arxiv: 2605.14212 · v1 · pith:PPVKZDEHnew · submitted 2026-05-14 · 💻 cs.AI

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Yaolun Zhang , Yujie Zhao , Nan Wang , Yiran Wu , Jiayu Chang , Yizhao Chen , Qingyun Wu , Jishen Zhao

show 1 more author

Huazheng Wang

This is my paper

Pith reviewed 2026-05-15 02:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords automatic multi-agent systemsend-to-end reinforcement learningagent designco-evolutioncredit assignmentself-executing agentsmeta-learning

0 comments

The pith

MetaAgent-X jointly trains the designer and executors of automatic multi-agent systems using end-to-end reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MetaAgent-X to overcome the limitation in current automatic multi-agent systems where only the meta-designer is trained while execution agents remain frozen. It proposes an end-to-end reinforcement learning approach that optimizes both the design of agent workflows and their execution together. By using script-based generation and collecting rollouts that cover both designer and executor actions, the framework assigns credit to improve both parts. This is supported by Executor Designer Hierarchical Rollout and Stagewise Co-evolution techniques for stable training. The result is better performance than prior methods, showing that fully trainable self-designing and self-executing agentic models are achievable.

Core claim

MetaAgent-X is an end-to-end reinforcement learning framework for automatic multi-agent systems that enables joint optimization of MAS design and execution by generating scripts, collecting hierarchical rollouts, and performing credit assignment across designer and executor trajectories. Using Executor Designer Hierarchical Rollout and Stagewise Co-evolution, it achieves stable optimization and up to 21.7% performance gains over existing baselines, with both components improving during training.

What carries the argument

Executor Designer Hierarchical Rollout combined with Stagewise Co-evolution, which structures the training process to expose co-evolution dynamics and provide stable joint optimization of designer and executor policies.

Load-bearing premise

The hierarchical rollout structure and stagewise training provide stable joint optimization with accurate credit assignment between designer and executor without introducing biases or instabilities.

What would settle it

A training run in which overall performance fails to exceed baselines or in which one component's improvement comes at the direct expense of the other would indicate the joint optimization is not working as claimed.

Figures

Figures reproduced from arXiv: 2605.14212 by Huazheng Wang, Jiayu Chang, Jishen Zhao, Nan Wang, Qingyun Wu, Yaolun Zhang, Yiran Wu, Yizhao Chen, Yujie Zhao.

**Figure 1.** Figure 1: From Partial Adaptation to End-to-End Trainable Automatic MAS. A. Comparison of three automatic MAS paradigms. B. Overview of our training framework. Meanwhile, as agentic reinforcement learning and self-evolving paradigms have emerged as promising pathways to transform large language models into interactive, continuously improving decisionmakers [Wang et al., 2025c, Cheng et al., 2025, Li et al., 2025b, … view at source ↗

**Figure 2.** Figure 2: Overview of the end to end online MetaAgent-X pipeline. The Designer first generate a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Training-reward dynamics ablations of the proposed stagewise co-evolution. Does Stagewise Co-evolution Help? We compare the proposed schedule on Qwen3-8B with three variants: coupled training, executor-only training, and designer-only training. In the coupled setting, trajectories from both roles update the shared policy simultaneously. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity analysis on the stage length for designer–executor alternation. One-step [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaAgent-X shows joint end-to-end RL for both designer and executor in automatic MAS, with reported gains and ablations, but the experimental details are too thin to fully trust the credit assignment story yet.

read the letter

The main advance is training the meta-designer and the execution agents together in one RL loop instead of freezing the executors while only tuning the designer. They add a hierarchical rollout to collect trajectories at both levels and a stagewise co-evolution schedule to keep the joint updates from blowing up. The abstract says this produces up to 21.7% better results than prior automatic MAS baselines and that ablations show both components improving over time, which at least suggests the co-evolution process is doing something observable rather than just designer drift.

Referee Report

2 major / 2 minor

Summary. The paper introduces MetaAgent-X, an end-to-end reinforcement learning framework for automatic multi-agent systems that jointly optimizes script-based MAS design and execution via Executor Designer Hierarchical Rollout and Stagewise Co-evolution. It reports consistent outperformance of existing automatic MAS baselines with gains up to 21.7%, supported by ablations showing joint improvement of designer and executor components throughout training.

Significance. If the empirical results and stability claims hold under rigorous verification, the work would establish end-to-end trainable automatic MAS as a viable paradigm, overcoming the frozen-executor ceiling of prior methods and providing evidence for practical self-designing agentic models.

major comments (2)

Abstract and experimental sections: the central claim of up to 21.7% gains and stable joint optimization rests on ablations showing component-wise improvement, yet no details are provided on experimental setup, baselines, statistical significance tests, task distributions, or variance across runs, creating major verification gaps for the reported performance.
Executor Designer Hierarchical Rollout section: credit assignment across designer-executor trajectories is described via shared rollouts, but without explicit mention of variance reduction (e.g., baseline subtraction or importance sampling corrections), designer actions may receive biased signals correlated with downstream executor noise, undermining the claim that Stagewise Co-evolution delivers unbiased co-evolution.

minor comments (2)

Abstract: the phrase 'which creating a frozen-executor ceiling' contains a grammatical error that should be corrected for clarity.
Notation: the distinction between 'designer trajectories' and 'executor trajectories' in the hierarchical rollout is introduced without a formal definition or diagram, making the credit propagation mechanism harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below and have revised the manuscript to provide the requested details and clarifications.

read point-by-point responses

Referee: Abstract and experimental sections: the central claim of up to 21.7% gains and stable joint optimization rests on ablations showing component-wise improvement, yet no details are provided on experimental setup, baselines, statistical significance tests, task distributions, or variance across runs, creating major verification gaps for the reported performance.

Authors: We agree that the initial submission lacked sufficient experimental details. In the revised version we have added a dedicated Experimental Setup section that fully specifies the environments and task distributions, lists all baselines with implementation references, describes the evaluation protocol, reports statistical significance via paired t-tests (including p-values), and presents results as mean ± standard deviation over five independent runs with different random seeds. These additions directly close the verification gaps for the reported performance numbers. revision: yes
Referee: Executor Designer Hierarchical Rollout section: credit assignment across designer-executor trajectories is described via shared rollouts, but without explicit mention of variance reduction (e.g., baseline subtraction or importance sampling corrections), designer actions may receive biased signals correlated with downstream executor noise, undermining the claim that Stagewise Co-evolution delivers unbiased co-evolution.

Authors: We thank the referee for highlighting this omission. The hierarchical rollout already applies a learned value baseline for advantage estimation on designer actions precisely to reduce variance and mitigate correlation with executor noise. This was not stated explicitly in the original text. We have now inserted the exact baseline formulation, the advantage estimator, and a short proof sketch showing that the resulting signals remain unbiased under the on-policy sampling used by Stagewise Co-evolution. Importance sampling corrections are unnecessary because all rollouts are on-policy. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces an empirical RL framework for joint optimization of MAS designer and executor via hierarchical rollouts and stagewise co-evolution. No load-bearing steps reduce by the paper's own equations or self-citations to fitted inputs by construction; reported performance gains (up to 21.7%) and ablations are presented as experimental outcomes from training trajectories, with credit assignment described as part of the RL setup rather than a self-referential definition. The framework remains self-contained against external benchmarks without renaming known results or smuggling ansatzes via prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard reinforcement learning assumptions plus the effectiveness of the two proposed training techniques; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5541 in / 1046 out tokens · 35316 ms · 2026-05-15T02:45:37.586097+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, and Jishen Zhao

URLhttps://arxiv.org/abs/2603.22918. Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, and Jishen Zhao. Mage: A multi-agent engine for automated rtl code generation, 2024. URLhttps://arxiv.org/abs/2412.07822. Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. Stronger-mas: Multi-agent reinforcement learning for collab...

work page arXiv 2024
[2]

14 Figure 4: Sensitivity analysis on the stage length for designer–executor alternation

Solution formatting.The final agent must produce its answer within a standardized output format, ensuring that the solution is reliably parseable for automated evaluation. 14 Figure 4: Sensitivity analysis on the stage length for designer–executor alternation. One-step alternation is unstable and collapses during training, while longer stages provide more...

work page
[3]

(...)

Delivery formatting.Inter-agent messages must be strictly enclosed within <delivery>...</delivery> tags. This constraint serves a dual purpose: it establishes a structured, easily parsable communication protocol, and crucially, it incentivizes agents todistill relevant information into concise deliverables rather than forwarding their entire reasoning tra...

work page 2025
[4]

Compute total number of ways to choose 4 numbers from 10:C(10,4)

work page
[5]

ComputeP(grand prize): number of ways to match all 4 numbers

work page
[6]

ComputeP(prize): number of ways to match exactly 2 or exactly 3 or exactly 4 numbers

work page
[7]

Use conditional probability formula:P(grand prize|prize) =P(grand prize)/P(prize)

work page
[8]

Simplify to lowest terms and findm+n Solution: Let’s compute step by step:

work page
[9]

Total number of ways to choose 4 numbers from 10: C(10,4) = 10!/(4!·6!) = (10×9×8×7)/(4×3×2×1) = 210 2.P(grand prize): All 4 numbers match • There’s only 1 way to choose exactly Jen’s 4 numbers • SoP(grand prize) = 1/210 3.P(prize): At least 2 numbers match • This includes: exactly 2 matches, exactly 3 matches, exactly 4 matches • Exactly 4 matches: 1 way...

work page 2024
[10]

Top row number:100a+ 10b+c

work page
[11]

Bottom row number:100d+ 10e+f

work page
[12]

Condition:(100a+ 10b+c) + (100d+ 10e+f) = 999

work page
[13]

First column number:10a+d

work page
[14]

Second column number:10b+e

work page
[15]

Third column number:10c+f

work page
[16]

This is a combinatorial counting problem with linear constraints

Condition:(10a+d) + (10b+e) + (10c+f) = 99 We need to count integer solutions with0≤a, b, c, d, e, f≤9. This is a combinatorial counting problem with linear constraints. Workflow Pattern:math_ensemble_judge Reason:The problem involves constraints that can be approached algebraically, combinatorially, or by case analysis. Multiple perspectives can help ver...

work page
[17]

Each solver independently solves the problem

work page
[18]

The judge receives all three solutions

work page
[19]

The judge evaluates correctness and selects one answer. 20

work page
[20]

AlgebraicSolver

The judge outputs the final answer in format. <code> from workflow import AgentNode, Workflow, ToolRegistry from workflow.nodes import EnsembleNode # No tools needed - pure reasoning tool_registry = ToolRegistry() # Create three math solvers with different approaches solver1 = AgentNode( name="AlgebraicSolver", system_prompt=( "You are a mathematician who...

work page
[21]

eversions

We can check if this is automatically satisfied given a+d= 9 , b+e= 9 , c+f= 9 . Let’s verify: from a+d= 9 , d= 9−a . From b+e= 9 , e= 9−b . From c+f= 9 , f= 9−c . Then a+c is just a+c . d+e= (9−a) + (9−b) = 18−a−b.b+f=b+ (9−c) = 9 +b−c. So 10(a+c) + (18−a−b) + (9 +b−c) = 9a+ 9c+ 27. We want this to equal99:9a+ 9c= 72, soa+c= 8. Possible (a, c): (0,8),(1,...

work page
[22]

Takes the last elementx = a_n

work page
[23]

Partitions the array into two stable parts: elements≤x(left) and elements> x(right)

work page
[24]

We need to compute the number of eversions until this happens

Concatenates left+right The process stops when an eversion doesn’t change the array. We need to compute the number of eversions until this happens. Key observations: • Each eversion moves the last element to the end of the left partition (all elements≤it) • The array eventually becomes sorted in non-decreasing order • The process stops when the array is a...

work page
[25]

CodeSolver writes initial solution

work page
[26]

UnitTestAgent tests it thoroughly

work page
[27]

CodeSolver refines based on feedback

work page
[28]

UnitTestAgent verifies again

work page
[29]

""Execute Python code and return stdout/stderr

Final solution output <code> from workflow import AgentNode, Workflow, ToolRegistry from workflow.nodes import ReflectionNode import subprocess # Set up code interpreter tool tool_registry = ToolRegistry() def execute_code(code: str) -> str: """Execute Python code and return stdout/stderr.""" try: result = subprocess.run( ["python", "-c", code], capture_o...

work page
[30]

Take the last elementx

work page
[31]

Partition array into elements≤x(left) and> x(right), maintaining original order

work page
[32]

Let me test with the example

Concatenate left+right We need to find the minimumksuch that afterkeversions the array stops changing. Let me test with the example. First example: [2,4,1,5,3] . Last element is 3. Partition: elements ≤3 are [2,1,3] , elements>3are[4,5]. Result:[2,1,3,4,5]. . . .similar walk-through of the second and third examples. . . Wait, let me think: when we partiti...

work page
[33]

if current < M , then M is the maximum element > current

Logical Flaw:the response has a critical error in its reasoning. It claims that “if current < M , then M is the maximum element > current”, but this is not true. In the second example, when current = 1 , M= 5 , but the maximum element>1is4, not5. So the next current should be4, notM

work page
[34]

It only describes the problem and the insight but doesn’t give a working solution

Incomplete Algorithm:the response doesn’t provide a complete algorithm. It only describes the problem and the insight but doesn’t give a working solution

work page
[35]

Missing Edge Cases:the response doesn’t consider edge cases like duplicate elements, already sorted arrays, or arrays where all elements are equal

work page
[36]

No Code Implementation:the response doesn’t provide any code implementation, making it impossible to test or use

work page
[37]

max of elements > current

Confusion in Terminology:the response uses “max of elements > current” but doesn’t explain how to compute this efficiently. Let me write a complete solution and test it: . . .execute_codetool call with a candidatemin_eversionsfunction and tests on the three sample inputs. . . Executor: CodeSolver – final solution I’ll provide the final solution with a cle...

work page
[38]

Find the maximum elementMin the array

work page
[39]

If the last element is alreadyM, return0

work page
[40]

Otherwise, find the number of steps where each step moves to the last element in the array that is greater than the current element

work page
[41]

\n".join(map(str, results))) if __name__ ==

This can be done efficiently by precomputing the maximum elements in decreasing order from the end of the array. Solution: def solve(): import sys input = sys.stdin.read data = input().split() t = int(data[0]) idx = 1 results = [] for _ in range(t): n = int(data[idx]); idx += 1 a = list(map(int, data[idx:idx + n])); idx += n # Find the maximum element M =...

work page 2024
[42]

R = 34, N = 8: r = 34 * (8 - sin(theta/2)) * cos(theta/2)

work page
[43]

MathSolver

R = 1, N = 2024: r = 1 * (2024 - sin(theta/2)) * cos(theta/2) Since r is the same, we can set them equal: 34 * (8 - sin(theta/2)) * cos(theta/2) = (2024 - sin(theta/2)) * cos(theta/2) Assuming cos(theta/2) != 0, we can divide: 34 * (8 - sin(theta/2)) = 2024 - sin(theta/2) 272 - 34 sin(theta/2) = 2024 - sin(theta/2) 272 - 2024 = 34 sin(theta/2) - sin(theta...

work page 2024
[44]

8 circles of radius 34, each tangent to two sides of the triangle and to each other

work page
[45]

2024 circles of radius 1, arranged similarly - Each sequence’s centers lie on a line parallel to the third side - The distance between centers of consecutive circles in a sequence is 2*radius **Approach:**

work page 2024
[46]

Let the triangle have sides a, b, c with inradius r

work page
[47]

For the sequence of circles tangent to sides AB and BC, the line of centers is parallel to side AC

work page
[48]

The distance from the incenter to side AC is r

work page
[49]

The distance from the incenter to the line of centers of the 34-radius circles is r - 34 (or r - 1 for the 1-radius circles)

work page
[50]

eight circles of radius 34 are sequentially tangent, and two of the circles are tangent to AB and BC of triangle ABC, respectively

The distance between centers of consecutive circles in a sequence is 2*radius \elide[the first attempt derives equations from the wrong distance relation] But we also know that this length is (n-1) * 2r1. So: 30 L * (r - d)/r = (n-1) * 2r1 But d = r - r1, so: L * (r - (r - r1))/r = (n-1) * 2r1 L * r1/r = (n-1) * 2r1 L/r = 2(n-1) L = 2r(n-1) So L = 2r(n-1)...

work page 2024
[51]

L * (r - 34)/r = 476

work page
[52]

238 = 2 * 7 * 17; 2023 = 7 * 17^2

L * (r - 1)/r = 4046 Divide equation 2 by equation 1: (L * (r - 1)/r) / (L * (r - 34)/r) = 4046/476 (r - 1)/(r - 34) = 4046/476 Simplify 4046/476: Both divisible by 2: 2023/238. 238 = 2 * 7 * 17; 2023 = 7 * 17^2. So 2023/238 = 17/2. So: (r - 1)/(r - 34) = 17/2 2(r - 1) = 17(r - 34) 2r - 2 = 17r - 578 576 = 15r r = 576/15 = 192/5 So the inradius r = 192/5....

work page 2023