MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning
Pith reviewed 2026-05-15 02:45 UTC · model grok-4.3
The pith
MetaAgent-X jointly trains the designer and executors of automatic multi-agent systems using end-to-end reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MetaAgent-X is an end-to-end reinforcement learning framework for automatic multi-agent systems that enables joint optimization of MAS design and execution by generating scripts, collecting hierarchical rollouts, and performing credit assignment across designer and executor trajectories. Using Executor Designer Hierarchical Rollout and Stagewise Co-evolution, it achieves stable optimization and up to 21.7% performance gains over existing baselines, with both components improving during training.
What carries the argument
Executor Designer Hierarchical Rollout combined with Stagewise Co-evolution, which structures the training process to expose co-evolution dynamics and provide stable joint optimization of designer and executor policies.
Load-bearing premise
The hierarchical rollout structure and stagewise training provide stable joint optimization with accurate credit assignment between designer and executor without introducing biases or instabilities.
What would settle it
A training run in which overall performance fails to exceed baselines or in which one component's improvement comes at the direct expense of the other would indicate the joint optimization is not working as claimed.
Figures
read the original abstract
Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MetaAgent-X, an end-to-end reinforcement learning framework for automatic multi-agent systems that jointly optimizes script-based MAS design and execution via Executor Designer Hierarchical Rollout and Stagewise Co-evolution. It reports consistent outperformance of existing automatic MAS baselines with gains up to 21.7%, supported by ablations showing joint improvement of designer and executor components throughout training.
Significance. If the empirical results and stability claims hold under rigorous verification, the work would establish end-to-end trainable automatic MAS as a viable paradigm, overcoming the frozen-executor ceiling of prior methods and providing evidence for practical self-designing agentic models.
major comments (2)
- Abstract and experimental sections: the central claim of up to 21.7% gains and stable joint optimization rests on ablations showing component-wise improvement, yet no details are provided on experimental setup, baselines, statistical significance tests, task distributions, or variance across runs, creating major verification gaps for the reported performance.
- Executor Designer Hierarchical Rollout section: credit assignment across designer-executor trajectories is described via shared rollouts, but without explicit mention of variance reduction (e.g., baseline subtraction or importance sampling corrections), designer actions may receive biased signals correlated with downstream executor noise, undermining the claim that Stagewise Co-evolution delivers unbiased co-evolution.
minor comments (2)
- Abstract: the phrase 'which creating a frozen-executor ceiling' contains a grammatical error that should be corrected for clarity.
- Notation: the distinction between 'designer trajectories' and 'executor trajectories' in the hierarchical rollout is introduced without a formal definition or diagram, making the credit propagation mechanism harder to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point-by-point below and have revised the manuscript to provide the requested details and clarifications.
read point-by-point responses
-
Referee: Abstract and experimental sections: the central claim of up to 21.7% gains and stable joint optimization rests on ablations showing component-wise improvement, yet no details are provided on experimental setup, baselines, statistical significance tests, task distributions, or variance across runs, creating major verification gaps for the reported performance.
Authors: We agree that the initial submission lacked sufficient experimental details. In the revised version we have added a dedicated Experimental Setup section that fully specifies the environments and task distributions, lists all baselines with implementation references, describes the evaluation protocol, reports statistical significance via paired t-tests (including p-values), and presents results as mean ± standard deviation over five independent runs with different random seeds. These additions directly close the verification gaps for the reported performance numbers. revision: yes
-
Referee: Executor Designer Hierarchical Rollout section: credit assignment across designer-executor trajectories is described via shared rollouts, but without explicit mention of variance reduction (e.g., baseline subtraction or importance sampling corrections), designer actions may receive biased signals correlated with downstream executor noise, undermining the claim that Stagewise Co-evolution delivers unbiased co-evolution.
Authors: We thank the referee for highlighting this omission. The hierarchical rollout already applies a learned value baseline for advantage estimation on designer actions precisely to reduce variance and mitigate correlation with executor noise. This was not stated explicitly in the original text. We have now inserted the exact baseline formulation, the advantage estimator, and a short proof sketch showing that the resulting signals remain unbiased under the on-policy sampling used by Stagewise Co-evolution. Importance sampling corrections are unnecessary because all rollouts are on-policy. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper introduces an empirical RL framework for joint optimization of MAS designer and executor via hierarchical rollouts and stagewise co-evolution. No load-bearing steps reduce by the paper's own equations or self-citations to fitted inputs by construction; reported performance gains (up to 21.7%) and ablations are presented as experimental outcomes from training trajectories, with credit assignment described as part of the RL setup rather than a self-referential definition. The framework remains self-contained against external benchmarks without renaming known results or smuggling ansatzes via prior self-citations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, and Jishen Zhao
URLhttps://arxiv.org/abs/2603.22918. Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, and Jishen Zhao. Mage: A multi-agent engine for automated rtl code generation, 2024. URLhttps://arxiv.org/abs/2412.07822. Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. Stronger-mas: Multi-agent reinforcement learning for collab...
-
[2]
14 Figure 4: Sensitivity analysis on the stage length for designer–executor alternation
Solution formatting.The final agent must produce its answer within a standardized output format, ensuring that the solution is reliably parseable for automated evaluation. 14 Figure 4: Sensitivity analysis on the stage length for designer–executor alternation. One-step alternation is unstable and collapses during training, while longer stages provide more...
-
[3]
Delivery formatting.Inter-agent messages must be strictly enclosed within <delivery>...</delivery> tags. This constraint serves a dual purpose: it establishes a structured, easily parsable communication protocol, and crucially, it incentivizes agents todistill relevant information into concise deliverables rather than forwarding their entire reasoning tra...
work page 2025
-
[4]
Compute total number of ways to choose 4 numbers from 10:C(10,4)
-
[5]
ComputeP(grand prize): number of ways to match all 4 numbers
-
[6]
ComputeP(prize): number of ways to match exactly 2 or exactly 3 or exactly 4 numbers
-
[7]
Use conditional probability formula:P(grand prize|prize) =P(grand prize)/P(prize)
-
[8]
Simplify to lowest terms and findm+n Solution: Let’s compute step by step:
-
[9]
Total number of ways to choose 4 numbers from 10: C(10,4) = 10!/(4!·6!) = (10×9×8×7)/(4×3×2×1) = 210 2.P(grand prize): All 4 numbers match • There’s only 1 way to choose exactly Jen’s 4 numbers • SoP(grand prize) = 1/210 3.P(prize): At least 2 numbers match • This includes: exactly 2 matches, exactly 3 matches, exactly 4 matches • Exactly 4 matches: 1 way...
work page 2024
-
[10]
Top row number:100a+ 10b+c
-
[11]
Bottom row number:100d+ 10e+f
-
[12]
Condition:(100a+ 10b+c) + (100d+ 10e+f) = 999
-
[13]
First column number:10a+d
-
[14]
Second column number:10b+e
-
[15]
Third column number:10c+f
-
[16]
This is a combinatorial counting problem with linear constraints
Condition:(10a+d) + (10b+e) + (10c+f) = 99 We need to count integer solutions with0≤a, b, c, d, e, f≤9. This is a combinatorial counting problem with linear constraints. Workflow Pattern:math_ensemble_judge Reason:The problem involves constraints that can be approached algebraically, combinatorially, or by case analysis. Multiple perspectives can help ver...
-
[17]
Each solver independently solves the problem
-
[18]
The judge receives all three solutions
-
[19]
The judge evaluates correctness and selects one answer. 20
-
[20]
The judge outputs the final answer in format. <code> from workflow import AgentNode, Workflow, ToolRegistry from workflow.nodes import EnsembleNode # No tools needed - pure reasoning tool_registry = ToolRegistry() # Create three math solvers with different approaches solver1 = AgentNode( name="AlgebraicSolver", system_prompt=( "You are a mathematician who...
-
[21]
We can check if this is automatically satisfied given a+d= 9 , b+e= 9 , c+f= 9 . Let’s verify: from a+d= 9 , d= 9−a . From b+e= 9 , e= 9−b . From c+f= 9 , f= 9−c . Then a+c is just a+c . d+e= (9−a) + (9−b) = 18−a−b.b+f=b+ (9−c) = 9 +b−c. So 10(a+c) + (18−a−b) + (9 +b−c) = 9a+ 9c+ 27. We want this to equal99:9a+ 9c= 72, soa+c= 8. Possible (a, c): (0,8),(1,...
-
[22]
Takes the last elementx = a_n
-
[23]
Partitions the array into two stable parts: elements≤x(left) and elements> x(right)
-
[24]
We need to compute the number of eversions until this happens
Concatenates left+right The process stops when an eversion doesn’t change the array. We need to compute the number of eversions until this happens. Key observations: • Each eversion moves the last element to the end of the left partition (all elements≤it) • The array eventually becomes sorted in non-decreasing order • The process stops when the array is a...
-
[25]
CodeSolver writes initial solution
-
[26]
UnitTestAgent tests it thoroughly
-
[27]
CodeSolver refines based on feedback
-
[28]
UnitTestAgent verifies again
-
[29]
""Execute Python code and return stdout/stderr
Final solution output <code> from workflow import AgentNode, Workflow, ToolRegistry from workflow.nodes import ReflectionNode import subprocess # Set up code interpreter tool tool_registry = ToolRegistry() def execute_code(code: str) -> str: """Execute Python code and return stdout/stderr.""" try: result = subprocess.run( ["python", "-c", code], capture_o...
-
[30]
Take the last elementx
-
[31]
Partition array into elements≤x(left) and> x(right), maintaining original order
-
[32]
Concatenate left+right We need to find the minimumksuch that afterkeversions the array stops changing. Let me test with the example. First example: [2,4,1,5,3] . Last element is 3. Partition: elements ≤3 are [2,1,3] , elements>3are[4,5]. Result:[2,1,3,4,5]. . . .similar walk-through of the second and third examples. . . Wait, let me think: when we partiti...
-
[33]
if current < M , then M is the maximum element > current
Logical Flaw:the response has a critical error in its reasoning. It claims that “if current < M , then M is the maximum element > current”, but this is not true. In the second example, when current = 1 , M= 5 , but the maximum element>1is4, not5. So the next current should be4, notM
-
[34]
It only describes the problem and the insight but doesn’t give a working solution
Incomplete Algorithm:the response doesn’t provide a complete algorithm. It only describes the problem and the insight but doesn’t give a working solution
-
[35]
Missing Edge Cases:the response doesn’t consider edge cases like duplicate elements, already sorted arrays, or arrays where all elements are equal
-
[36]
No Code Implementation:the response doesn’t provide any code implementation, making it impossible to test or use
-
[37]
Confusion in Terminology:the response uses “max of elements > current” but doesn’t explain how to compute this efficiently. Let me write a complete solution and test it: . . .execute_codetool call with a candidatemin_eversionsfunction and tests on the three sample inputs. . . Executor: CodeSolver – final solution I’ll provide the final solution with a cle...
-
[38]
Find the maximum elementMin the array
-
[39]
If the last element is alreadyM, return0
-
[40]
Otherwise, find the number of steps where each step moves to the last element in the array that is greater than the current element
-
[41]
\n".join(map(str, results))) if __name__ ==
This can be done efficiently by precomputing the maximum elements in decreasing order from the end of the array. Solution: def solve(): import sys input = sys.stdin.read data = input().split() t = int(data[0]) idx = 1 results = [] for _ in range(t): n = int(data[idx]); idx += 1 a = list(map(int, data[idx:idx + n])); idx += n # Find the maximum element M =...
work page 2024
-
[42]
R = 34, N = 8: r = 34 * (8 - sin(theta/2)) * cos(theta/2)
-
[43]
R = 1, N = 2024: r = 1 * (2024 - sin(theta/2)) * cos(theta/2) Since r is the same, we can set them equal: 34 * (8 - sin(theta/2)) * cos(theta/2) = (2024 - sin(theta/2)) * cos(theta/2) Assuming cos(theta/2) != 0, we can divide: 34 * (8 - sin(theta/2)) = 2024 - sin(theta/2) 272 - 34 sin(theta/2) = 2024 - sin(theta/2) 272 - 2024 = 34 sin(theta/2) - sin(theta...
work page 2024
-
[44]
8 circles of radius 34, each tangent to two sides of the triangle and to each other
-
[45]
2024 circles of radius 1, arranged similarly - Each sequence’s centers lie on a line parallel to the third side - The distance between centers of consecutive circles in a sequence is 2*radius **Approach:**
work page 2024
-
[46]
Let the triangle have sides a, b, c with inradius r
-
[47]
For the sequence of circles tangent to sides AB and BC, the line of centers is parallel to side AC
-
[48]
The distance from the incenter to side AC is r
-
[49]
The distance from the incenter to the line of centers of the 34-radius circles is r - 34 (or r - 1 for the 1-radius circles)
-
[50]
The distance between centers of consecutive circles in a sequence is 2*radius \elide[the first attempt derives equations from the wrong distance relation] But we also know that this length is (n-1) * 2r1. So: 30 L * (r - d)/r = (n-1) * 2r1 But d = r - r1, so: L * (r - (r - r1))/r = (n-1) * 2r1 L * r1/r = (n-1) * 2r1 L/r = 2(n-1) L = 2r(n-1) So L = 2r(n-1)...
work page 2024
-
[51]
L * (r - 34)/r = 476
-
[52]
238 = 2 * 7 * 17; 2023 = 7 * 17^2
L * (r - 1)/r = 4046 Divide equation 2 by equation 1: (L * (r - 1)/r) / (L * (r - 34)/r) = 4046/476 (r - 1)/(r - 34) = 4046/476 Simplify 4046/476: Both divisible by 2: 2023/238. 238 = 2 * 7 * 17; 2023 = 7 * 17^2. So 2023/238 = 17/2. So: (r - 1)/(r - 34) = 17/2 2(r - 1) = 17(r - 34) 2r - 2 = 17r - 578 576 = 15r r = 576/15 = 192/5 So the inradius r = 192/5....
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.