pith. machine review for the scientific record. sign in

arxiv: 2605.14212 · v1 · pith:PPVKZDEHnew · submitted 2026-05-14 · 💻 cs.AI

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Pith reviewed 2026-05-15 02:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords automatic multi-agent systemsend-to-end reinforcement learningagent designco-evolutioncredit assignmentself-executing agentsmeta-learning
0
0 comments X

The pith

MetaAgent-X jointly trains the designer and executors of automatic multi-agent systems using end-to-end reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MetaAgent-X to overcome the limitation in current automatic multi-agent systems where only the meta-designer is trained while execution agents remain frozen. It proposes an end-to-end reinforcement learning approach that optimizes both the design of agent workflows and their execution together. By using script-based generation and collecting rollouts that cover both designer and executor actions, the framework assigns credit to improve both parts. This is supported by Executor Designer Hierarchical Rollout and Stagewise Co-evolution techniques for stable training. The result is better performance than prior methods, showing that fully trainable self-designing and self-executing agentic models are achievable.

Core claim

MetaAgent-X is an end-to-end reinforcement learning framework for automatic multi-agent systems that enables joint optimization of MAS design and execution by generating scripts, collecting hierarchical rollouts, and performing credit assignment across designer and executor trajectories. Using Executor Designer Hierarchical Rollout and Stagewise Co-evolution, it achieves stable optimization and up to 21.7% performance gains over existing baselines, with both components improving during training.

What carries the argument

Executor Designer Hierarchical Rollout combined with Stagewise Co-evolution, which structures the training process to expose co-evolution dynamics and provide stable joint optimization of designer and executor policies.

Load-bearing premise

The hierarchical rollout structure and stagewise training provide stable joint optimization with accurate credit assignment between designer and executor without introducing biases or instabilities.

What would settle it

A training run in which overall performance fails to exceed baselines or in which one component's improvement comes at the direct expense of the other would indicate the joint optimization is not working as claimed.

Figures

Figures reproduced from arXiv: 2605.14212 by Huazheng Wang, Jiayu Chang, Jishen Zhao, Nan Wang, Qingyun Wu, Yaolun Zhang, Yiran Wu, Yizhao Chen, Yujie Zhao.

Figure 1
Figure 1. Figure 1: From Partial Adaptation to End-to-End Trainable Automatic MAS. A. Comparison of three automatic MAS paradigms. B. Overview of our training framework. Meanwhile, as agentic reinforcement learning and self-evolving paradigms have emerged as promising pathways to transform large language models into interactive, continuously improving decision￾makers [Wang et al., 2025c, Cheng et al., 2025, Li et al., 2025b, … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the end to end online MetaAgent-X pipeline. The Designer first generate a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training-reward dynamics ablations of the proposed stagewise co-evolution. Does Stagewise Co-evolution Help? We compare the proposed schedule on Qwen3-8B with three vari￾ants: coupled training, executor-only training, and designer-only training. In the coupled setting, tra￾jectories from both roles update the shared policy simultaneously. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity analysis on the stage length for designer–executor alternation. One-step [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MetaAgent-X, an end-to-end reinforcement learning framework for automatic multi-agent systems that jointly optimizes script-based MAS design and execution via Executor Designer Hierarchical Rollout and Stagewise Co-evolution. It reports consistent outperformance of existing automatic MAS baselines with gains up to 21.7%, supported by ablations showing joint improvement of designer and executor components throughout training.

Significance. If the empirical results and stability claims hold under rigorous verification, the work would establish end-to-end trainable automatic MAS as a viable paradigm, overcoming the frozen-executor ceiling of prior methods and providing evidence for practical self-designing agentic models.

major comments (2)
  1. Abstract and experimental sections: the central claim of up to 21.7% gains and stable joint optimization rests on ablations showing component-wise improvement, yet no details are provided on experimental setup, baselines, statistical significance tests, task distributions, or variance across runs, creating major verification gaps for the reported performance.
  2. Executor Designer Hierarchical Rollout section: credit assignment across designer-executor trajectories is described via shared rollouts, but without explicit mention of variance reduction (e.g., baseline subtraction or importance sampling corrections), designer actions may receive biased signals correlated with downstream executor noise, undermining the claim that Stagewise Co-evolution delivers unbiased co-evolution.
minor comments (2)
  1. Abstract: the phrase 'which creating a frozen-executor ceiling' contains a grammatical error that should be corrected for clarity.
  2. Notation: the distinction between 'designer trajectories' and 'executor trajectories' in the hierarchical rollout is introduced without a formal definition or diagram, making the credit propagation mechanism harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below and have revised the manuscript to provide the requested details and clarifications.

read point-by-point responses
  1. Referee: Abstract and experimental sections: the central claim of up to 21.7% gains and stable joint optimization rests on ablations showing component-wise improvement, yet no details are provided on experimental setup, baselines, statistical significance tests, task distributions, or variance across runs, creating major verification gaps for the reported performance.

    Authors: We agree that the initial submission lacked sufficient experimental details. In the revised version we have added a dedicated Experimental Setup section that fully specifies the environments and task distributions, lists all baselines with implementation references, describes the evaluation protocol, reports statistical significance via paired t-tests (including p-values), and presents results as mean ± standard deviation over five independent runs with different random seeds. These additions directly close the verification gaps for the reported performance numbers. revision: yes

  2. Referee: Executor Designer Hierarchical Rollout section: credit assignment across designer-executor trajectories is described via shared rollouts, but without explicit mention of variance reduction (e.g., baseline subtraction or importance sampling corrections), designer actions may receive biased signals correlated with downstream executor noise, undermining the claim that Stagewise Co-evolution delivers unbiased co-evolution.

    Authors: We thank the referee for highlighting this omission. The hierarchical rollout already applies a learned value baseline for advantage estimation on designer actions precisely to reduce variance and mitigate correlation with executor noise. This was not stated explicitly in the original text. We have now inserted the exact baseline formulation, the advantage estimator, and a short proof sketch showing that the resulting signals remain unbiased under the on-policy sampling used by Stagewise Co-evolution. Importance sampling corrections are unnecessary because all rollouts are on-policy. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces an empirical RL framework for joint optimization of MAS designer and executor via hierarchical rollouts and stagewise co-evolution. No load-bearing steps reduce by the paper's own equations or self-citations to fitted inputs by construction; reported performance gains (up to 21.7%) and ablations are presented as experimental outcomes from training trajectories, with credit assignment described as part of the RL setup rather than a self-referential definition. The framework remains self-contained against external benchmarks without renaming known results or smuggling ansatzes via prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard reinforcement learning assumptions plus the effectiveness of the two proposed training techniques; no explicit free parameters, axioms, or invented entities are detailed in the abstract.

pith-pipeline@v0.9.0 · 5541 in / 1046 out tokens · 35316 ms · 2026-05-15T02:45:37.586097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [1]

    Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, and Jishen Zhao

    URLhttps://arxiv.org/abs/2603.22918. Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, and Jishen Zhao. Mage: A multi-agent engine for automated rtl code generation, 2024. URLhttps://arxiv.org/abs/2412.07822. Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. Stronger-mas: Multi-agent reinforcement learning for collab...

  2. [2]

    14 Figure 4: Sensitivity analysis on the stage length for designer–executor alternation

    Solution formatting.The final agent must produce its answer within a standardized output format, ensuring that the solution is reliably parseable for automated evaluation. 14 Figure 4: Sensitivity analysis on the stage length for designer–executor alternation. One-step alternation is unstable and collapses during training, while longer stages provide more...

  3. [3]

    (...)

    Delivery formatting.Inter-agent messages must be strictly enclosed within <delivery>...</delivery> tags. This constraint serves a dual purpose: it establishes a structured, easily parsable communication protocol, and crucially, it incentivizes agents todistill relevant information into concise deliverables rather than forwarding their entire reasoning tra...

  4. [4]

    Compute total number of ways to choose 4 numbers from 10:C(10,4)

  5. [5]

    ComputeP(grand prize): number of ways to match all 4 numbers

  6. [6]

    ComputeP(prize): number of ways to match exactly 2 or exactly 3 or exactly 4 numbers

  7. [7]

    Use conditional probability formula:P(grand prize|prize) =P(grand prize)/P(prize)

  8. [8]

    Simplify to lowest terms and findm+n Solution: Let’s compute step by step:

  9. [9]

    Total number of ways to choose 4 numbers from 10: C(10,4) = 10!/(4!·6!) = (10×9×8×7)/(4×3×2×1) = 210 2.P(grand prize): All 4 numbers match • There’s only 1 way to choose exactly Jen’s 4 numbers • SoP(grand prize) = 1/210 3.P(prize): At least 2 numbers match • This includes: exactly 2 matches, exactly 3 matches, exactly 4 matches • Exactly 4 matches: 1 way...

  10. [10]

    Top row number:100a+ 10b+c

  11. [11]

    Bottom row number:100d+ 10e+f

  12. [12]

    Condition:(100a+ 10b+c) + (100d+ 10e+f) = 999

  13. [13]

    First column number:10a+d

  14. [14]

    Second column number:10b+e

  15. [15]

    Third column number:10c+f

  16. [16]

    This is a combinatorial counting problem with linear constraints

    Condition:(10a+d) + (10b+e) + (10c+f) = 99 We need to count integer solutions with0≤a, b, c, d, e, f≤9. This is a combinatorial counting problem with linear constraints. Workflow Pattern:math_ensemble_judge Reason:The problem involves constraints that can be approached algebraically, combinatorially, or by case analysis. Multiple perspectives can help ver...

  17. [17]

    Each solver independently solves the problem

  18. [18]

    The judge receives all three solutions

  19. [19]

    The judge evaluates correctness and selects one answer. 20

  20. [20]

    AlgebraicSolver

    The judge outputs the final answer in format. <code> from workflow import AgentNode, Workflow, ToolRegistry from workflow.nodes import EnsembleNode # No tools needed - pure reasoning tool_registry = ToolRegistry() # Create three math solvers with different approaches solver1 = AgentNode( name="AlgebraicSolver", system_prompt=( "You are a mathematician who...

  21. [21]

    eversions

    We can check if this is automatically satisfied given a+d= 9 , b+e= 9 , c+f= 9 . Let’s verify: from a+d= 9 , d= 9−a . From b+e= 9 , e= 9−b . From c+f= 9 , f= 9−c . Then a+c is just a+c . d+e= (9−a) + (9−b) = 18−a−b.b+f=b+ (9−c) = 9 +b−c. So 10(a+c) + (18−a−b) + (9 +b−c) = 9a+ 9c+ 27. We want this to equal99:9a+ 9c= 72, soa+c= 8. Possible (a, c): (0,8),(1,...

  22. [22]

    Takes the last elementx = a_n

  23. [23]

    Partitions the array into two stable parts: elements≤x(left) and elements> x(right)

  24. [24]

    We need to compute the number of eversions until this happens

    Concatenates left+right The process stops when an eversion doesn’t change the array. We need to compute the number of eversions until this happens. Key observations: • Each eversion moves the last element to the end of the left partition (all elements≤it) • The array eventually becomes sorted in non-decreasing order • The process stops when the array is a...

  25. [25]

    CodeSolver writes initial solution

  26. [26]

    UnitTestAgent tests it thoroughly

  27. [27]

    CodeSolver refines based on feedback

  28. [28]

    UnitTestAgent verifies again

  29. [29]

    ""Execute Python code and return stdout/stderr

    Final solution output <code> from workflow import AgentNode, Workflow, ToolRegistry from workflow.nodes import ReflectionNode import subprocess # Set up code interpreter tool tool_registry = ToolRegistry() def execute_code(code: str) -> str: """Execute Python code and return stdout/stderr.""" try: result = subprocess.run( ["python", "-c", code], capture_o...

  30. [30]

    Take the last elementx

  31. [31]

    Partition array into elements≤x(left) and> x(right), maintaining original order

  32. [32]

    Let me test with the example

    Concatenate left+right We need to find the minimumksuch that afterkeversions the array stops changing. Let me test with the example. First example: [2,4,1,5,3] . Last element is 3. Partition: elements ≤3 are [2,1,3] , elements>3are[4,5]. Result:[2,1,3,4,5]. . . .similar walk-through of the second and third examples. . . Wait, let me think: when we partiti...

  33. [33]

    if current < M , then M is the maximum element > current

    Logical Flaw:the response has a critical error in its reasoning. It claims that “if current < M , then M is the maximum element > current”, but this is not true. In the second example, when current = 1 , M= 5 , but the maximum element>1is4, not5. So the next current should be4, notM

  34. [34]

    It only describes the problem and the insight but doesn’t give a working solution

    Incomplete Algorithm:the response doesn’t provide a complete algorithm. It only describes the problem and the insight but doesn’t give a working solution

  35. [35]

    Missing Edge Cases:the response doesn’t consider edge cases like duplicate elements, already sorted arrays, or arrays where all elements are equal

  36. [36]

    No Code Implementation:the response doesn’t provide any code implementation, making it impossible to test or use

  37. [37]

    max of elements > current

    Confusion in Terminology:the response uses “max of elements > current” but doesn’t explain how to compute this efficiently. Let me write a complete solution and test it: . . .execute_codetool call with a candidatemin_eversionsfunction and tests on the three sample inputs. . . Executor: CodeSolver – final solution I’ll provide the final solution with a cle...

  38. [38]

    Find the maximum elementMin the array

  39. [39]

    If the last element is alreadyM, return0

  40. [40]

    Otherwise, find the number of steps where each step moves to the last element in the array that is greater than the current element

  41. [41]

    \n".join(map(str, results))) if __name__ ==

    This can be done efficiently by precomputing the maximum elements in decreasing order from the end of the array. Solution: def solve(): import sys input = sys.stdin.read data = input().split() t = int(data[0]) idx = 1 results = [] for _ in range(t): n = int(data[idx]); idx += 1 a = list(map(int, data[idx:idx + n])); idx += n # Find the maximum element M =...

  42. [42]

    R = 34, N = 8: r = 34 * (8 - sin(theta/2)) * cos(theta/2)

  43. [43]

    MathSolver

    R = 1, N = 2024: r = 1 * (2024 - sin(theta/2)) * cos(theta/2) Since r is the same, we can set them equal: 34 * (8 - sin(theta/2)) * cos(theta/2) = (2024 - sin(theta/2)) * cos(theta/2) Assuming cos(theta/2) != 0, we can divide: 34 * (8 - sin(theta/2)) = 2024 - sin(theta/2) 272 - 34 sin(theta/2) = 2024 - sin(theta/2) 272 - 2024 = 34 sin(theta/2) - sin(theta...

  44. [44]

    8 circles of radius 34, each tangent to two sides of the triangle and to each other

  45. [45]

    2024 circles of radius 1, arranged similarly - Each sequence’s centers lie on a line parallel to the third side - The distance between centers of consecutive circles in a sequence is 2*radius **Approach:**

  46. [46]

    Let the triangle have sides a, b, c with inradius r

  47. [47]

    For the sequence of circles tangent to sides AB and BC, the line of centers is parallel to side AC

  48. [48]

    The distance from the incenter to side AC is r

  49. [49]

    The distance from the incenter to the line of centers of the 34-radius circles is r - 34 (or r - 1 for the 1-radius circles)

  50. [50]

    eight circles of radius 34 are sequentially tangent, and two of the circles are tangent to AB and BC of triangle ABC, respectively

    The distance between centers of consecutive circles in a sequence is 2*radius \elide[the first attempt derives equations from the wrong distance relation] But we also know that this length is (n-1) * 2r1. So: 30 L * (r - d)/r = (n-1) * 2r1 But d = r - r1, so: L * (r - (r - r1))/r = (n-1) * 2r1 L * r1/r = (n-1) * 2r1 L/r = 2(n-1) L = 2r(n-1) So L = 2r(n-1)...

  51. [51]

    L * (r - 34)/r = 476

  52. [52]

    238 = 2 * 7 * 17; 2023 = 7 * 17^2

    L * (r - 1)/r = 4046 Divide equation 2 by equation 1: (L * (r - 1)/r) / (L * (r - 34)/r) = 4046/476 (r - 1)/(r - 34) = 4046/476 Simplify 4046/476: Both divisible by 2: 2023/238. 238 = 2 * 7 * 17; 2023 = 7 * 17^2. So 2023/238 = 17/2. So: (r - 1)/(r - 34) = 17/2 2(r - 1) = 17(r - 34) 2r - 2 = 17r - 578 576 = 15r r = 576/15 = 192/5 So the inradius r = 192/5....