arxiv: 2511.06101 · v3 · submitted 2025-11-08 · 💻 cs.LG · cs.AI· cs.CL

SynthAgent: Adapting Web Agents with Synthetic Supervision

Zhaoyang Wang , Yiming Liang , Xuchao Zhang , Qianhui Wu , Siwei Han , Anson Bastos , Rujia Wang , Chetan Bansal

show 4 more authors

Baolin Peng Jianfeng Gao Saravan Rajmohan Huaxiu Yao

This is my paper

Pith reviewed 2026-05-17 23:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords web agentssynthetic supervisiontask refinementtrajectory refinementagent adaptationweb navigationsynthetic data generation

0 comments

The pith

SynthAgent adapts web agents to new sites by refining synthetic tasks and trajectories to cut hallucinations and noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Web agents struggle to handle unfamiliar websites because real demonstrations and tasks for those sites are scarce. SynthAgent addresses this by generating all training data synthetically and then applying two rounds of refinement: one to fix tasks that conflict with what the site actually shows, and another to clean up the recorded action sequences using broader context. The refined data is used to fine-tune open-source agents so they perform better on the target environment. The authors show through experiments that this dual-refinement approach beats prior synthetic-data techniques, suggesting that data quality matters more than sheer volume when adapting agents.

Core claim

By first synthesizing diverse tasks through categorized exploration of web elements, then refining tasks only on detected conflicts during collection and refining trajectories afterward with global context, SynthAgent produces supervision data that allows open-source web agents to adapt successfully to previously unseen websites.

What carries the argument

Dual refinement: conflict-triggered task refinement during trajectory collection plus global-context trajectory refinement after collection, which together reduce hallucinations and misalignments while keeping task variety.

If this is right

Fine-tuned agents achieve higher success rates on target websites than agents trained on existing synthetic methods.
Task refinement during collection preserves consistency while removing hallucinations that would otherwise make tasks impossible to execute.
Trajectory refinement with global context reduces redundant or misaligned actions that degrade learning.
The entire pipeline runs without any human demonstrations or environment-specific real data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same refinement logic could be tested on non-web agent domains such as mobile interfaces or API agents where synthetic trajectories are also noisy.
If the method scales, organizations could adapt agents to internal tools or client sites with far less manual data labeling.
Future work might measure how much each refinement step contributes separately by ablating one at a time on the same base tasks.

Load-bearing premise

The refinements reliably remove unexecutable tasks and noisy actions without discarding the diversity needed for genuine generalization to new websites.

What would settle it

Train an agent on SynthAgent data and an identical agent on unrefined synthetic data, then test both on a fresh website; if the two agents achieve statistically indistinguishable success rates, the value of the dual refinement steps is falsified.

Figures

Figures reproduced from arXiv: 2511.06101 by Anson Bastos, Baolin Peng, Chetan Bansal, Huaxiu Yao, Jianfeng Gao, Qianhui Wu, Rujia Wang, Saravan Rajmohan, Siwei Han, Xuchao Zhang, Yiming Liang, Zhaoyang Wang.

**Figure 2.** Figure 2: Overview of SynthAgent compared to baseline methods OS-Genesis (Sun et al., 2025) and Explorer (Pahuja et al., 2025). SynthAgent consists of four steps: (1) Task Synthesis, generating diverse tasks via categorized exploration; (2) Task Refinement, refining tasks during trajectory collection to avoid task hallucinations; (3) Trajectory Refinement, globally refining collected trajectories to remove noisy ac… view at source ↗

**Figure 3.** Figure 3: t-SNE visualization of synthesized tasks. Tasks from test set are written by human. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Performance of SynthAgent across different websites with varying synthesis data amounts. 5.3 Data Scaling To evaluate the scalability of SynthAgent, we measure across different data amounts in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Case study of Task Refinement. During task execution, [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Case study of Trajectory Refinement. After trajectory collection, [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Web agents struggle to adapt to new websites due to the scarcity of environment specific tasks and demonstrations. Recent works have explored synthetic data generation to address this challenge, however, they suffer from data quality issues where synthesized tasks contain hallucinations that cannot be executed, and collected trajectories are noisy with redundant or misaligned actions. In this paper, we propose SynthAgent, a fully synthetic supervision framework that aims at improving synthetic data quality via dual refinement of both tasks and trajectories. Our approach begins by synthesizing diverse tasks through categorized exploration of web elements, ensuring efficient coverage of the target environment. During trajectory collection, tasks are refined only when conflicts with observations are detected, which mitigates hallucinations while preserving task consistency. After collection, we conduct trajectory refinement with global context to mitigate potential noise or misalignments. Finally, we fine-tune open-source web agents on the refined synthetic data to adapt them to the target environment. Experimental results demonstrate that SynthAgent outperforms existing synthetic data methods, validating the importance of high-quality synthetic supervision. The code is publicly available at https://github.com/aiming-lab/SynthAgent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SynthAgent adds conflict-triggered task fixes and global-context trajectory cleaning to synthetic web-agent data, but the reported gains are end-to-end only and do not directly show those steps improve data quality.

read the letter

Colleague, the main point is that SynthAgent generates tasks via categorized web-element exploration, then refines tasks only on detected conflicts and cleans trajectories afterward with global context before fine-tuning open-source agents. The authors report better adaptation results than prior synthetic-data baselines on new websites. That dual-refinement pipeline is the concrete addition over earlier synthetic-generation work. The public code is a plus for anyone who wants to test or extend the steps. The approach targets a real bottleneck—scarce environment-specific demonstrations—and the pipeline is described clearly enough that a practitioner could implement the main ideas. The soft spot is the evaluation. All gains are measured by final agent success rates after fine-tuning. There are no pre/post numbers on how many hallucinations or misaligned actions the refinements actually removed, no ablation that turns the conflict check or global-context pass on and off, and no independent quality metrics on the synthetic data itself. Without those, it is hard to tell whether the refinements are doing the work or whether the results trace to data volume, model choice, or evaluation overlap. The abstract states outperformance but leaves the supporting details for the full paper. This is useful reading for groups working on web agents or synthetic supervision pipelines. A reader who needs practical ways to adapt agents without new human data will pick up the refinement tactics even if the causal evidence stays indirect. The work is coherent on its own terms and shows honest engagement with the data-quality problem, so it deserves a serious referee to check the experimental controls and whether the claimed quality improvements hold up under closer inspection.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SynthAgent, a fully synthetic supervision framework for adapting web agents to new websites. It synthesizes diverse tasks via categorized exploration of web elements, refines tasks only upon detected conflicts to mitigate hallucinations while preserving consistency, performs global-context trajectory refinement to reduce noise and misalignments, and fine-tunes open-source agents on the resulting data. The central claim is that this dual-refinement pipeline yields higher-quality supervision than prior synthetic methods, as evidenced by superior agent performance on target environments.

Significance. If the refinements can be shown to measurably improve task executability and trajectory alignment beyond what data volume or base-model choice alone would achieve, the work would offer a practical, scalable route to environment-specific adaptation for web agents without human annotation. Public code release aids reproducibility and allows direct verification of the pipeline.

major comments (2)

[Abstract and Experiments section] The abstract and introduction assert outperformance over existing synthetic data methods, yet no concrete metrics, baseline names, dataset sizes, or statistical significance tests are reported. The Experiments section must supply these details together with controls that isolate the contribution of dual refinement from confounding factors such as data volume or evaluation-website overlap.
[Sections 3.2–3.3 (Task and Trajectory Refinement) and Experiments] The central claim that task refinement (only on conflicts) plus global-context trajectory refinement produces higher-quality supervision rests on the assumption that these steps reduce unexecutable tasks and misaligned actions. No independent quantification—such as pre/post oracle success rates on the synthetic tasks themselves or counts of hallucinated elements removed—is provided; only end-to-end agent success rates after fine-tuning are shown. This leaves the load-bearing quality-improvement argument unsupported.

minor comments (1)

[Section 3.1] Notation for the categorized exploration process and the conflict-detection condition could be made more precise to facilitate re-implementation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the changes we will make to improve clarity and evidence in the revised manuscript.

read point-by-point responses

Referee: [Abstract and Experiments section] The abstract and introduction assert outperformance over existing synthetic data methods, yet no concrete metrics, baseline names, dataset sizes, or statistical significance tests are reported. The Experiments section must supply these details together with controls that isolate the contribution of dual refinement from confounding factors such as data volume or evaluation-website overlap.

Authors: We agree that the Experiments section would benefit from more explicit reporting. In the revision we will add a table listing all baseline methods with their exact dataset sizes, report mean success rates with standard deviations, include statistical significance results (e.g., paired t-tests or Wilcoxon tests), and present two new controls: (1) an ablation varying synthetic data volume while keeping the dual-refinement pipeline fixed, and (2) explicit confirmation that none of the evaluation websites appear in the training data. revision: yes
Referee: [Sections 3.2–3.3 (Task and Trajectory Refinement) and Experiments] The central claim that task refinement (only on conflicts) plus global-context trajectory refinement produces higher-quality supervision rests on the assumption that these steps reduce unexecutable tasks and misaligned actions. No independent quantification—such as pre/post oracle success rates on the synthetic tasks themselves or counts of hallucinated elements removed—is provided; only end-to-end agent success rates after fine-tuning are shown. This leaves the load-bearing quality-improvement argument unsupported.

Authors: The referee is correct that direct, independent measures of refinement quality are currently absent. We will add a dedicated analysis subsection that reports: the fraction of tasks triggering conflict-based refinement, the number of hallucinated elements removed (via oracle inspection on a held-out sample), and pre/post-refinement executability rates measured by an oracle policy. These results will be presented before the end-to-end fine-tuning experiments to directly support the quality claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external validation

full rationale

The manuscript describes an empirical pipeline for synthetic task generation, conflict-triggered task refinement, global-context trajectory refinement, and subsequent fine-tuning of web agents. No equations, derivations, fitted parameters, or first-principles results are presented that reduce to the inputs by construction. Claims rest on end-to-end experimental comparisons against prior synthetic methods on held-out websites, which constitute independent external benchmarks rather than self-referential reductions. Self-citations, if present, are not load-bearing for any central premise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Method relies on standard assumptions from web agent and synthetic data literature; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5533 in / 970 out tokens · 28289 ms · 2026-05-17T23:21:10.685602+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SynthAgent, a fully synthetic supervision framework that aims at improving synthetic data quality via dual refinement of both tasks and trajectories... categorized exploration... task refinement triggered by explicit conflict detection... trajectory refinement with global context
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experimental results demonstrate that SynthAgent outperforms existing synthetic data methods

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 4 internal anchors

[1]

Mind2Web: Towards a Generalist Agent for the Web

Mind2web: Towards a generalist agent for the web.Preprint, arXiv:2306.06070. Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Gra- ham Neubig. 2023. Pal: Program-aided language models.Preprint, arXiv:2211.10435. Yifei Gao, Junhong Ye, Jiaqi Wang, and Jitao Sang

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu

Websynthesis: World-model-guided mcts for efficient webui-trajectory synthesis.Preprint, arXiv:2507.04370. Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. 2024. Scaling synthetic data cre- ation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zh...

work page arXiv 2024
[3]

WebSailor: Navigating Super-human Reasoning for Web Agent

Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592. Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe

work page internal anchor Pith review Pith/arXiv arXiv
[4]

InThe Twelfth Inter- national Conference on Learning Representations

Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, and 1 others

work page
[5]

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Advances and challenges in foundation agents: From brain-inspired intelligence to evolution- ary, collaborative, and safe systems.arXiv preprint arXiv:2504.01990. Xing Han Lu, Zden ˇek Kasner, and Siva Reddy. 2024. WebLINX: Real-world website navigation with multi- turn dialogue. InForty-first International Conference on Machine Learning. Zhengxi Lu, Yuxi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis.arXiv preprint arXiv:2412.19723,

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis.Preprint, arXiv:2412.19723. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https:// github.com/tatsu-lab/stanford_alpaca. Qwen Team...

work page arXiv 2023
[7]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large lan- guage models to follow complex instructions.arXiv preprint a...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Fully explore the current page and its content to understand its functionality and layout

work page
[9]

Categorize ALL provided {element_num} elements into different categories (list[dict]) based on their natural purpose

work page
[10]

Add a {const_uninteractive_category} category (list[int]) for non-interactive elements that cannot be CLICK, TYPE, or HOVER

work page
[11]

action": choose from [CLICK, TYPE, HOVER],

For each category (except {const_uninteractive_category}), decide: { "action": choose from [CLICK, TYPE, HOVER], "element_id": id_of_element (int), "value": if TYPE, provide text to type; else ”, "low-level_instruction": concise description of the action } Example low-level instructions: - "Click on the ’Add to Cart’ button next to the product to add it t...

work page
[12]

Analysis

Provide an appropriate and meaningful value for "value" if the action is TYPE. Examples: - For a search box, generate a realistic search query. - For a textbox, generate plausible text according to context. **Output Requirements** Return ONLY a JSON dictionary (no commentary) with the following format: { "Analysis": "your analysis of the current page stat...

work page
[13]

Click on the ’Add to Cart’ button next to the product to add it to your shopping cart

Sub-Instruction: Create a natural language instruction for the current action based on the interface changes it caused. The instruction should be concise, clear, and actionable, incorporating specific details critical to the task, such as elements, file names, timestamps, or other relevant content visible in the screenshots. For example: - "Click on the ’...

work page
[14]

Then, examine key elements in both screenshots and consider possible operations based on these elements

Analysis: Carefully analyze the before-and-after screenshots step by step, focusing on the changes caused by the action. Then, examine key elements in both screenshots and consider possible operations based on these elements. For example: "The previous screen displayed the main interface of a shopping website, featuring multiple product categories and sev...

work page
[15]

summarize the information about a product

High-Level Instruction: Based on the before-and-after screenshots, the action, and the analysis, generate a high-level task that you believe can be completed within the current interface. There are three types of tasks: - Information seeking: The user wants to obtain certain information from the webpage, such as product details, reviews, map information, ...

work page 2025
[16]

{high_level_task}

High-Level Task (your ultimate goal to finish): "{high_level_task}"

work page
[17]

Current Page (only current view, not full page, you may need to scroll to see more): - URL: {url} - Accessibility Tree (Page Context): {page_context} - Elements (addressable in this view): {elements} - Screenshot (only current view, not full page): {img_info}

work page
[18]

History of Actions ({hint_for_history}): {previous_state_action} **Critical Rules for Success**

work page
[19]

Issue only actions valid for the current observation (elements, accessibility tree, screenshot)

work page
[20]

Propose ONE atomic action per item in your Potential-Actions list; actions must be independently executable

work page
[21]

Prefer element IDs from the current Elements list for CLICK/TYPE/HOVER

work page
[22]

Provide meaningful non-empty value if action∈{TYPE, SCROLL, GOTO, NONE, STOP}

work page
[23]

If the task is complete, use NONE with the final answer in value; do not propose further actions

work page
[24]

Be concise, avoid redundant/risky actions; each action must advance the task

work page
[25]

If the task is hallucinated/low-quality/impossible, cautiously choose STOP based on observations/history. Pseudo-code for deciding STOP: if high_level_task lacks required info→STOP if high_level_task contains hallucinations→STOP if task is inappropriate/harmful→STOP if multiple (≥3) similar attempts already failed→STOP else→consider NON-STOP actions

work page
[26]

state_observation_summary

First write a "state_observation_summary", then do step-by-step "reasoning", then decide "next_action"

work page
[27]

Expect MULTIPLE steps; choose the next action that changes state; continue iteratively

work page
[28]

You MUST actively decide the next step; do not choose NONE/STOP unless certain of finish/impossibility

work page
[29]

reasoning

In "reasoning", explicitly apply the STOP vs NON-STOP pseudo-code

work page
[30]

Actively explore alternatives before STOP if current approach stalls

work page
[31]

Elements (addressable in this view)

Choose elements strictly from "Elements (addressable in this view)"; justify this choice in "reasoning"

work page
[32]

If the page doesn’t change after an action, consider SCROLL to reveal more elements

work page
[33]

MM/DD/YYYY

Special note: when typing a date, use "MM/DD/YYYY". **Output Requirements** Return ONLY a JSON dictionary (no commentary) with: { "state_observation_summary": "1–3 sentence summary of the current state relevant to the task", "reasoning": "step-by-step reasoning to decide the next action; include rule-based justification and STOP check", "next_action": { "...

work page
[34]

What is the most expensive product in the ’Electronics’ category?

**Information Seeking** — User aims to retrieve specific information from the website. - Examples: - "What is the most expensive product in the ’Electronics’ category?" - "What are the top 5 posts in the ’Technology’ forum?" - "Summarize the reviews for the product ’iPhone 11’."

work page
[35]

Go to the billing page to check the latest transactions

**Site Navigation** — User aims to reach a specific page or site state. - Examples: - "Go to the billing page to check the latest transactions." - "Navigate to the ’Contact Us’ page and fill out the form to express interest in joining the company." - "Find the wiki page of ’the youngest person to receive a Nobel Prize’."

work page
[36]

Create a user account with username ’bob2134’ and password ’128nxc18zxv’

**Content Modification** — User aims to change site content or settings. - Examples: - "Create a user account with username ’bob2134’ and password ’128nxc18zxv’." - "Post a new article titled ’The Future of AI’ in the ’Technology’ forum." - "Create a code repo named ’Agent’ and add a README with the text ’This is a code repo for an intelligent agent.’" ##...

work page
[37]

**Invalid or Inconsistent Goal** — target entity/page/action does not exist, cannot be located, or conflicts with observed facts

work page
[38]

**Insufficient Executable Details** — essential parameters are missing and cannot be inferred

work page
[39]

### When NOT to REFINE- Goal is valid and consistent with observations

**Stalled or Repetitive Execution** — three or more consecutive actions show no meaningful change, or same error repeats. ### When NOT to REFINE- Goal is valid and consistent with observations. - Essential parameters are available or can be inferred. - Actions show measurable progress. - No persistent or repetitive failures detected. ### How to REFINE If ...

work page
[40]

**Concretize Missing Details** — add essential parameters from history or observation

work page
[41]

**Align with Reality** — replace hallucinated entities with actual ones found on the site

work page
[42]

**Downscope the Goal** — adjust to the next achievable milestone

work page
[43]

Analysis

**Preserve Task Type** — keep within same category unless required otherwise. ## Goal Ensure the refined task is either already completed or highly likely to complete within the next 1–2 steps. ## Output Requirements - Format: JSON dictionary only, no commentary. - Fields:- "Analysis": Step-by-step reasoning. - "Need-to-Refine": "yes" or "no". - "High-Lev...

work page
[44]

Previous High-Level-Tasks (oldest to newest): <start_previous_high_level_tasks> {previous_high_level_tasks} <end_previous_high_level_tasks>

work page
[45]

History of Actions ({hint_for_history}): <start_action> {previous_state_action} <end_action>

work page
[46]

{curr_url}

Current Page (only current view): - URL: "{curr_url}" - Page Context: <start_context> {curr_state_context} <end_context> - Screenshot: "{img_info}" — You ONLY need to return a JSON dictionary formatted as follows (no commentary): { "Analysis": "step-by-step reasoning", "Need-to-Refine": "yes or no", "High-Level-Task": "refined task if yes, otherwise empty...

work page
[47]

Goal Alignment (0–25): Steps relevant to the high-level task

work page
[48]

Logical Order (0–25): Steps follow a coherent and sensible sequence

work page
[49]

Efficiency (0–25): Avoids redundant or unnecessary actions

work page
[50]

task": "<exact high-level task string>

Success Likelihood (0–25): Likely to end successfully with NONE (non-empty value). Note: The score is advisory; the final decision (keep/refine/drop) depends on qualitative judgment. — ### Decision Policy - Always ensure kept/refined trajectories end with a NONE action and non-empty value. - If refining, reorder or delete existing steps (do not add new on...

work page
[51]

navigation vs

**Intent Variety (0–25):** Do the tasks represent different user intents (e.g., information seeking vs. navigation vs. modification)?

work page
[52]

**Action Diversity (0–25):** Do the tasks require different types of GUI interactions (e.g., clicking, typing, scrolling, submitting forms)?

work page
[53]

score": <int>, // 0–100 total diversity score

**Goal Coverage (0–25):** Do the tasks explore different meaningful aspects or functionalities of the environment?4. **Redundancy Minimization (0–25):** Are there minimal duplicate or near-duplicate tasks (i.e., no rephrasing of the same goal)? — ## Output Requirement (STRICT) Return ONLY one JSON object (no extra text, no code fences): { "score": <int>, ...

work page