pith. sign in

arxiv: 2606.18837 · v2 · pith:K44Q3IYRnew · submitted 2026-06-17 · 💻 cs.MA · cs.AI· cs.LG

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Pith reviewed 2026-06-26 18:44 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.LG
keywords multi-agent systemsmeta-skilllarge language modelsautomatic system generationexperience retentiontrajectory rolloutcontrastive analysis
0
0 comments X

The pith

Skill-MAS evolves a reusable Meta-Skill for multi-agent LLM systems by distilling strategy principles from task trajectories without parametric updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes Skill-MAS as a third path for automatic multi-agent system generation that retains experience separately from model training. It treats high-level orchestration as an evolvable Meta-Skill refined in a closed loop of sampling multiple trajectories per task and then applying selective reflection with hierarchical contrastive analysis on priority tasks. This setup is meant to combine the capability of frontier LLMs with accumulated generalizable strategies. A sympathetic reader would care because it addresses the repeated-search waste of inference-time methods and the capability ceiling of training-time methods.

Core claim

Skill-MAS conceptualizes the high-level orchestration capability as an evolvable Meta-Skill and refines architectural knowledge through a closed optimization loop of Multi-Trajectory Rollout, which samples a behavioral distribution for each task, and Selective Reflection, which adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles.

What carries the argument

The Meta-Skill as high-level orchestration capability, refined via the closed loop of Multi-Trajectory Rollout and Selective Reflection with hierarchical contrastive analysis.

If this is right

  • Automatic MAS generation can achieve performance gains on complex benchmarks while using frontier LLMs without gradient updates.
  • The method maintains a favorable cost-performance trade-off by avoiding repeated identical searches and large-scale training.
  • Evolved Meta-Skills exhibit robustness and strong transferability across unseen tasks and different LLMs.
  • Experience retention is decoupled from parametric updates, allowing scaling to large models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could support continual adaptation of agent orchestration rules across entirely new domains without retraining base models.
  • If the distillation step generalizes, similar rollout-plus-reflection loops might apply to other LLM orchestration problems such as tool-use planning.
  • Transferability across LLMs suggests the Meta-Skill captures structural patterns that are somewhat model-agnostic.

Load-bearing premise

Hierarchical contrastive analysis on selectively chosen tasks can reliably distill generalizable strategy-level principles rather than task-specific patterns or noise.

What would settle it

Testing whether the evolved Meta-Skill produces no performance gain or loses transferability when applied to unseen benchmarks or switched to a different LLM.

Figures

Figures reproduced from arXiv: 2606.18837 by Chengwei Qin, Hehai Lin, Qi Yang.

Figure 1
Figure 1. Figure 1: Overview of MAS paradigms. (a)-(b) Comparison of existing Inference-time and Training-time MAS, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The evolutionary loop of Skill-MAS. The Meta-Skill [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Skill transferability heatmap across LLMs (DS: DeepSeek-V4-Flash, GPT: GPT-5.4-Nano) and tasks (BCP: BrowseComp-Plus, VITA: VitaBench). Right: Performance scaling across increasing multi￾trajectory rollout numbers (K = 3, 5, 7). scores and the performance gains (∆) over “Skill￾MAS-init”, while [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Meta-Skill Evolution on BrowseComp-Plus. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the initial Meta-Skill used for Skill-MAS-init and Skill-MAS evolution. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the optimized Meta-Skill for DeepResearchBench (DeepSeek-V4-Flash, Part 1/3). [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the optimized Meta-Skill for DeepResearchBench (DeepSeek-V4-Flash, Part 2/3). [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of the optimized Meta-Skill for DeepResearchBench (DeepSeek-V4-Flash, Part 3/3). [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of the optimized Meta-Skill for HLE-MATH (DeepSeek-V4-Flash, Part 1/2). [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of the optimized Meta-Skill for HLE-MATH (DeepSeek-V4-Flash, Part 2/2). [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Illustration of the optimized Meta-Skill for BrowseComp-Plus (DeepSeek-V4-Flash, Part 1/2). [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Illustration of the optimized Meta-Skill for BrowseComp-Plus (DeepSeek-V4-Flash, Part 2/2). [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Illustration of the optimized Meta-Skill for VitaBench (DeepSeek-V4-Flash, Part 1/2). [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Illustration of the optimized Meta-Skill for VitaBench (DeepSeek-V4-Flash, Part 2/2). [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: LLM-as-a-judge prompts used in DeepResearchBench. [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: LLM-as-a-judge prompts used in VitaBench. [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: MAS build contract used in the three-stage Skill-MAS construction pipeline (Part 1/2). [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: MAS build contract used in the three-stage Skill-MAS construction pipeline (Part 2/2). [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Within-task reflection prompt in Skill-MAS evolution. [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Cross-task reflection prompt in Skill-MAS evolution. [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Skill optimization prompt for Skill-MAS evolution. [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
read the original abstract

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Skill-MAS, a third path for LLM-based automatic multi-agent system generation that evolves a Meta-Skill to retain orchestration experience without parametric updates. It employs a closed loop consisting of (1) Multi-Trajectory Rollout to sample behavioral distributions under the current Meta-Skill and (2) Selective Reflection that adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable strategy-level principles. Experiments across four complex benchmarks and four LLMs are claimed to demonstrate remarkable performance gains, a favorable cost-performance trade-off, robustness, and strong transferability to unseen tasks and different LLMs.

Significance. If the empirical claims hold, the work offers a conceptually appealing bridge between inference-time MAS (which cannot retain experience) and training-time MAS (which are limited by model scale). The decoupling of experience retention from gradient updates via an evolvable Meta-Skill could enable scalable, high-capability automatic MAS. The closed optimization loop and emphasis on hierarchical contrastive distillation represent a novel framing, though the significance hinges on whether the distilled principles are demonstrably general rather than benchmark-specific.

major comments (2)
  1. [§3.2] §3.2 (Selective Reflection): the description of hierarchical contrastive analysis does not specify how contrastive pairs are constructed, how hierarchy levels are defined, or the precise selection criteria for priority tasks. Without these details it is impossible to evaluate whether the procedure reliably extracts transferable orchestration strategies or instead amplifies task idiosyncrasies from the four benchmarks; this mechanism is load-bearing for the robustness and cross-task/cross-LLM transferability claims.
  2. [Experiments] Experiments section (transferability results): the reported strong transferability to unseen tasks is presented without explicit controls that isolate the contribution of the evolved Meta-Skill from possible memorization of benchmark patterns. A direct comparison against a baseline that applies task-specific heuristics distilled from the same rollouts would be required to substantiate that the output constitutes generalizable strategy-level principles rather than benchmark-tuned heuristics.
minor comments (2)
  1. [Abstract] The abstract states performance gains and cost trade-offs but does not name the four benchmarks or the four LLMs; adding these identifiers would improve reproducibility.
  2. [§3] Notation for the Meta-Skill and the contrastive loss (if any) should be introduced consistently in §3 and reused in the experimental tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas for improved clarity and rigor. We address each major comment point by point below and will revise the manuscript accordingly to strengthen the presentation of the Selective Reflection mechanism and the transferability analysis.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Selective Reflection): the description of hierarchical contrastive analysis does not specify how contrastive pairs are constructed, how hierarchy levels are defined, or the precise selection criteria for priority tasks. Without these details it is impossible to evaluate whether the procedure reliably extracts transferable orchestration strategies or instead amplifies task idiosyncrasies from the four benchmarks; this mechanism is load-bearing for the robustness and cross-task/cross-LLM transferability claims.

    Authors: We agree that the current description in §3.2 is high-level and would benefit from explicit specifications to allow readers to assess the mechanism's ability to produce generalizable principles. In the revised manuscript we will expand this section to detail: contrastive pairs are formed from trajectories sampled under the same Meta-Skill that differ substantially in end-to-end task success; hierarchy levels are organized as task-specific orchestration patterns, agent-role coordination rules, and system-wide workflow abstractions; and priority tasks are chosen by ranking tasks according to performance variance across the multi-trajectory rollout combined with a diversity score that favors tasks exposing systemic rather than idiosyncratic failures. These additions will directly address concerns about benchmark idiosyncrasies versus transferable strategy-level principles. revision: yes

  2. Referee: [Experiments] Experiments section (transferability results): the reported strong transferability to unseen tasks is presented without explicit controls that isolate the contribution of the evolved Meta-Skill from possible memorization of benchmark patterns. A direct comparison against a baseline that applies task-specific heuristics distilled from the same rollouts would be required to substantiate that the output constitutes generalizable strategy-level principles rather than benchmark-tuned heuristics.

    Authors: We acknowledge that the transferability results, while showing gains on unseen tasks and across LLMs, would be more convincing with an explicit control isolating the Meta-Skill from potential benchmark-specific memorization. In the revision we will add a new baseline experiment that distills task-specific heuristics directly from the identical multi-trajectory rollouts (without the selective reflection and hierarchical contrastive steps) and compares its transfer performance against the full Skill-MAS Meta-Skill. This comparison will provide evidence that the evolved Meta-Skill captures generalizable orchestration strategies beyond task-tuned heuristics. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; method is empirical and self-contained

full rationale

The paper describes Skill-MAS as an iterative loop of Multi-Trajectory Rollout followed by Selective Reflection via hierarchical contrastive analysis to evolve a Meta-Skill. No equations, fitted parameters, predictions, or first-principles derivations are presented that could reduce to inputs by construction. Claims of robustness and transferability rest on external benchmark experiments across four tasks and four LLMs, not on any self-referential fitting or self-citation chain. No self-definitional steps, ansatz smuggling, or renaming of known results appear. The derivation is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level concept of Meta-Skill itself.

invented entities (1)
  • Meta-Skill no independent evidence
    purpose: High-level orchestration capability treated as evolvable without model updates
    Introduced in the abstract as the central new object that is refined through rollout and reflection.

pith-pipeline@v0.9.1-grok · 5762 in / 1195 out tokens · 14529 ms · 2026-06-26T18:44:35.304878+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 4 linked inside Pith

  1. [1]

    Yu Li, Rui Miao, Zhengling Qi, and Tian Lan

    A survey on llm-based multi-agent sys- tems: workflow, infrastructure, and challenges.Vici- nagearth, 1(1):9. Yu Li, Rui Miao, Zhengling Qi, and Tian Lan. 2026. Arise: Agent reasoning with intrinsic skill evolu- tion in hierarchical reinforcement learning.arXiv preprint arXiv:2603.16060. Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yan...

  2. [2]

    Shuai Pan, Yixiang Liu, Jiaye Gao, Te Gao, Weiwen Liu, Jianghao Lin, Zhihui Fu, Jun Wang, Weinan Zhang, and Yong Yu

    Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158. Shuai Pan, Yixiang Liu, Jiaye Gao, Te Gao, Weiwen Liu, Jianghao Lin, Zhihui Fu, Jun Wang, Weinan Zhang, and Yong Yu. 2026. Skillmas: Skill co-evolution with llm-based multi-agent system.arXiv preprint arXiv:2605.09341. Long Phan, Alice Gatti, Ziwe...

  3. [3]

    Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang, Xiaobin Hu, Jinyang Guo, Yang Liu, and Yufei Guo

    Skill-r1: Agent skill evolution via reinforce- ment learning.arXiv preprint arXiv:2605.09359. Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang, Xiaobin Hu, Jinyang Guo, Yang Liu, and Yufei Guo. 2025a. Mas 2: Self-generative, self-configuring, self-rectifying multi-agent systems. arXiv preprint arXiv:2509.24323. Qian Wang, Tianyu Wang, Zhenheng ...

  4. [4]

    Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha

    Furina: A fully customizable role-playing benchmark via scalable multi-agent collaboration pipeline.arXiv preprint arXiv:2510.06800. Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha. 2026. Co-evolving llm decision and skill bank agents for long-horizon tasks. arXiv preprint arXiv:2604...

  5. [5]

    generate-once-and-deploy

    Skillrl: Evolving agents via recursive skill- augmented reinforcement learning.arXiv preprint arXiv:2602.08234. Feng Xiong, Zengbin Wang, Yong Wang, Xuecai Hu, Jinghan He, Liang Lin, Yuan Liu, and Xiangxiang Chu. 2026. Ace-skill: Bootstrapping multimodal agents with prioritized and clustered evolution.arXiv preprint arXiv:2605.08887. Fengli Xu, Qianyue Ha...

  6. [6]

    - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

    Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...

  7. [7]

    - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task

    Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...

  8. [8]

    Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: You can design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchi- cal, or Blackboard) based on Stage 1’s logical dependencies. For those complex b...

  9. [9]

    - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

    Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...

  10. [10]

    This node frames the entire problem space and ensures all downstream agents operate under a shared understanding

    Context-Scoping Root Node: A dedicated root sub-task that defines scope, key concepts, metrics, terminology, and evaluation criteria before any analytical work begins. This node frames the entire problem space and ensures all downstream agents operate under a shared understanding

  11. [11]

    These must be designed to run in parallel from the context-scoping root, with no intermediate sequential dependencies among them

    Parallel Analytical Branches: One sub-task per distinct analytical component (capped at four branches). These must be designed to run in parallel from the context-scoping root, with no intermediate sequential dependencies among them

  12. [12]

    Capability to analyze methods for

    Dedicated Synthesis Terminal Node: A final sub-task that receives the outputs of all parallel branches and integrates them into the requested cohesive output (e.g., report, article, synthesis). The synthesis node must be the only terminal node. - Hard Constraint: Strict sequential chaining of analytical components is disallowed for such tasks. If the quer...

  13. [13]

    Soft pass: coverage of <missing terms> and token count <X%> below threshold. Please expand in synthesis

    Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...

  14. [14]

    The following required dimension appears to have insufficient coverage: <dimension>. You must include a dedicated section addressing it

    Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: Design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchical, or Blackboard) based on Stage 1’s logical dependencies. For complex sub-tasks, embed ...

  15. [15]

    - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

    Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...

  16. [16]

    verification report

    Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...

  17. [17]

    Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: You can design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchi- cal, or Blackboard) based on Stage 1’s logical dependencies. For those complex b...

  18. [18]

    - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

    Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...

  19. [19]

    Best guess: [answer] (unverified constraints: [list])

    Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Weighted Constraint Satisfaction Protocol with Partial-Evidence Fallback: Every agent that evaluates or synthesiz...

  20. [20]

    Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: You can design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierar- chical, or Blackboard) based on Stage 1’s logical dependencies. For complex but imp...

  21. [21]

    - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task

    Task Decomposition Module (The "What") Core Objective: Analyze the user query and break it down into a logical blueprint. - Intent & Scope Analysis: Understand the macro objective, identify core requirements, and define the boundaries of the task. - Sub-task Breakdown: Decompose the high-level request into a set of discrete, manageable, and logically cohe...

  22. [22]

    Selector

    Agent Engineering Module (The "Who") Core Objective: Design specialized sub-agents tailored for the sub-tasks defined in Stage 1. - Role Profiling: Assign a unique identity and specialized role to each sub-agent based on its target sub-task. - Instruction Design: Draft precise system prompts/instructions. Define the agent’s specific goals, behavioral boun...

  23. [23]

    Reality Check

    Workflow & Orchestration Module (The "How") Core Objective: Wire the distinct agents from Stage 2 into a functional, executable Multi-Agent System (MAS). - Architectural Topology: Design the optimal MAS architecture (e.g., Sequential Pipeline, Router-based, Hierarchical, or Blackboard) based on Stage 1’s logical dependencies. For complex but important sub...

  24. [24]

    Analyze Each Criterion: Consider how each article fulfills the requirements of each criterion

  25. [25]

    Comparative Evaluation: Analyze how the two articles perform on each criterion, referencing the content and criterion explanation

  26. [26]

    Standard 1

    Score Separately: Based on your comparative analysis, score each article on each criterion (0-10 points). Scoring Rules For each criterion, score both articles on a scale of 0-10 (continuous values). The score should reflect the quality of performance on that criterion: - 0-2 points: Very poor performance. Almost completely fails to meet the criterion req...

  27. [27]

    trajectory fails due to poor reasoning

    Be specific: Avoid vague statements like "trajectory fails due to poor reasoning". Instead: "trajectory fails at step 5 because it incorrectly assumes X when the constraint requires Y"

  28. [28]

    Reference specific steps, actions, or outputs

    Use evidence: Ground every claim in concrete observations. Reference specific steps, actions, or outputs

  29. [29]

    The DIFFERENCE is where the insight lies

    Think contrastively: Always compare high vs low trajectories. The DIFFERENCE is where the insight lies

  30. [30]

    task is too hard

    Focus on actionability: Every diagnosis should lead to a concrete, implementable fix. Avoid unfixable issues like "task is too hard"

  31. [31]

    Quantify when possible: Use numbers (frequencies, percentages, counts) to support claims about patterns

  32. [32]

    Start with { and end with }

    Output pure JSON: No markdown code blocks, no extra text. Start with { and end with }. Begin your analysis now. Figure 19: Within-task reflection prompt in Skill-MAS evolution. 28 Skill-MAS Evolution (Cross-Task Reflection) System Prompt You are the diagnosis agent for Skill_MAS Step 2 (Trajectory Reflection Synthesis). Your task is to synthesize cross- s...

  33. [33]

    High cross-trajectory volatility: Large score variance across rollouts indicates unstable/inconsistent policy behavior

  34. [34]

    struggles with multi-step reasoning

    High intrinsic difficulty: Low average scores suggest systematic capability gaps === INPUT DATA === Phase 1 already analyzed each task’s rollouts. Below, each block contains: (1) the original problem / instruction text, and (2) the COMPLETE Phase-1 structured JSON for that task — every field in the Phase-1 schema (task_id, num_trajectories, score_statisti...

  35. [35]

    Be specific: Tie weaknesses/strengths to task_ids and concrete themes from the summaries when possible

  36. [36]

    Use evidence: Ground claims in the Phase-1 structured outputs and task text — do not invent unseen trajectory detail

  37. [37]

    Think globally: Patterns across samples drive prioritization

  38. [38]

    Focus on actionability: prioritized_fixes must be implementable in Step 3

  39. [39]

    Quantify when possible: Use counts where summaries allow

  40. [40]

    if text contains ’and’, split it

    Output pure JSON: No markdown code blocks, no extra text. Start with { and end with }. Begin your synthesis now. Figure 20: Cross-task reflection prompt in Skill-MAS evolution. 29 Skill-MAS Evolution (Skill Optimization) System PromptYou are an expert author and optimizer for Skill-MAS three-stage SKILL.md files. Your task is to improve the current SKILL....

  41. [41]

    Evidence-Driven Abstraction: Every change must resolve a flaw found in Step2, but the solution MUST be abstracted into a universal systems-engineering principle

  42. [42]

    Make dependencies clear

    Meaningful Depth: Do not just add adjectives. Add new sub-bullet points that introduce a concrete conceptual framework (e.g., instead of "Make dependencies clear", use "Build a Directed Acyclic Graph (DAG) mapping of logic state transitions")

  43. [43]

    Do not pile on multiple unrelated changes in a single pass

    Incremental evolution (hard limit): In this round, introduce at most one substantive conceptual upgrade per SKILL stage section (1, 2, 3 — each stage at most one focused improvement). Do not pile on multiple unrelated changes in a single pass

  44. [44]

    Format Requirements: - MUST start directly with the Y AML frontmatter (—)

    Output Format: Produce ONLY the complete updated SKILL.md. Format Requirements: - MUST start directly with the Y AML frontmatter (—). - MUST preserve the exact same Y AML keys and the exactly three-stage markdown structure (1, 2, 3). - NO markdown code fences around the entire output. - NO preamble, NO explanations, NO summary of changes. Output raw SKILL...