Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Fatemeh Pesaran zadeh; Gunhee Kim; Seyeon Choi; Siva Reddy; Xing Han L\`u

arxiv: 2605.20291 · v2 · pith:AKPME5T6new · submitted 2026-05-19 · 💻 cs.LG

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Fatemeh Pesaran Zadeh , Seyeon Choi , Xing Han L\`u , Siva Reddy , Gunhee Kim This is my paper

Pith reviewed 2026-05-21 08:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords web agentsout-of-domain generalizationtrajectory selectiondata efficiencyimportance and diversitygreedy algorithmAXTree pruningLLM agents

0 comments

The pith

Selecting important and diverse trajectories lets web agents generalize out of domain while cutting training costs by an order of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to demonstrate that web agents trained offline on full trajectory datasets can be made to generalize better to new websites by instead using a carefully chosen smaller subset of training data. The core idea is to pick trajectories that are both important on their own and diverse from each other in terms of the states, websites, and interaction patterns they involve, using a greedy algorithm to solve this selection problem under a fixed budget. Additional steps like pruning accessibility trees to focus only on the target of each action and generating reasoning in the model's own style further boost efficiency and reduce mismatch. A sympathetic reader would care because current approaches waste compute on redundant or noisy data and still fail when the agent encounters unfamiliar sites or tasks.

Core claim

The central discovery is that a greedy optimization of an objective combining unary importance scores with pairwise diversity measures across states, websites, and interaction patterns can identify a compact set of trajectories that, when used for fine-tuning, yields superior out-of-domain performance on web agent benchmarks compared to using the entire dataset, while delivering training speedups of approximately 9.7 to 12.5 times.

What carries the argument

The importance-diversity objective solved greedily to select trajectory steps, combined with target-centered AXTree pruning and model-generated rationales.

If this is right

Out-of-domain success rates increase on WebArena, WorkArena, and MiniWob when training with the selected data.
Training time is reduced by factors of 9.7 to 12.5 across Qwen2.5-7B, Gemma3-4B, and Qwen3-8B models.
The method applies to both AgentTrek and NNetNav training datasets.
Style-consistent rationales help reasoning-native models adapt better.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar selection criteria could improve efficiency in training agents for other environments like mobile apps or games.
The focus on diversity over interaction patterns may help address long-tail behaviors in agent tasks.
Reducing data volume this way might lower the barrier to iterating on web agent designs.

Load-bearing premise

A greedy solution to balancing importance and diversity will reliably choose trajectories from which the model learns generalizable behaviors for unseen websites and tasks.

What would settle it

If experiments on held-out websites show that models fine-tuned on Weasel-selected trajectories achieve lower task success rates than those trained on the full dataset or on randomly sampled trajectories of equal size.

Figures

Figures reproduced from arXiv: 2605.20291 by Fatemeh Pesaran zadeh, Gunhee Kim, Seyeon Choi, Siva Reddy, Xing Han L\`u.

**Figure 1.** Figure 1: Overview of WEASEL. Conventional trained web agents show a sharp performance drop under out-of-domain shifts to unseen websites and interaction patterns. WEASEL tackles this challenge via novel trajectory selection: it scores offline demonstration steps for goal relevance and diversity, then applies greedy subset selection under a fixed budget. Agents trained with WEASEL generalize better to unseen test… view at source ↗

**Figure 2.** Figure 2: (Left): An example of a curated trajectory after applying WEASEL. Although the original collected data contain noisy steps (t = 4), and erroneous actions (t = 0), WEASEL selects a compact subset that retains only the most informative steps (in red) for the goal. (Right): Overview of WEASEL. We first perform element-wise score calculation using unary importance and pairwise diversity. WEASEL then applies a … view at source ↗

**Figure 3.** Figure 3: Token distribution of 10K subsamples of AgentTrek (Xu et al., 2024) before pruning (green) and after target-centered pruning (blue). Pruning substantially reduces long-tail states, making the resulting sequences more manageable for training. quality term plus a sum of pairwise distances under a cardinality constraint (Borodin et al., 2017). For metric distances, a greedy algorithm achieves a constant-fa… view at source ↗

**Figure 4.** Figure 4: An illustration of Target-centered Pruning. Given a state st in the form of AXTree and gold action at, we retain only the AXTree elements within a fixed window of size w centered at the target index k ∗ t , producing the pruned state s˜t. The k-th node in the linearized AXTree at step t is denoted vt,k (e.g., vt,1, vt,2), and vt,k∗ t is the gold target node. 2.4. Target-centered Pruning Web states can be p… view at source ↗

**Figure 5.** Figure 5: Success rate decreases as the pruning offset increases. Results are reported on WebArena-Lite [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5$\times$ training speedups over standard fine-tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Weasel shows a practical greedy selection method that cuts training cost and lifts OOD scores on web agents, but the diversity term's contribution to generalization is not clearly isolated.

read the letter

Weasel picks a fixed-budget subset of trajectory steps by balancing unary importance with pairwise diversity over states, websites, and interaction patterns, then adds target-centered AXTree pruning and swaps expert traces for model-generated rationales. The main result is that this produces better out-of-domain performance on WebArena, WorkArena, and MiniWob while delivering roughly 10x training speedups over full fine-tuning across a few models and two training datasets.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Weasel, a trajectory selection method for offline training of web agents. It selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solved via a greedy algorithm. Additional components include target-centered AXTree pruning and replacement of expert traces with model-generated rationales for style consistency. Experiments on AgentTrek and NNetNav training data, evaluated on WebArena, WorkArena, and MiniWob with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B models, report improved out-of-domain performance together with 9.7-12.5× training speedups relative to standard fine-tuning. Code is released at the cited GitHub repository.

Significance. If the reported gains prove robust, the approach could meaningfully advance efficient offline training of generalizable web agents by addressing redundancy and noise in trajectory data. The public code release supports reproducibility and is a clear strength.

major comments (2)

[Abstract] Abstract: the reported OOD gains and speedups are presented without error bars, exact baseline implementation details, or an ablation isolating the diversity term from AXTree pruning and rationale replacement; these omissions are load-bearing because they prevent determining whether the central selection procedure, rather than the auxiliary efficiency steps, drives the claimed improvements.
[Method (objective and greedy algorithm)] Method section describing the objective and greedy algorithm: the claim that optimizing unary importance plus pairwise diversity over states/websites/patterns produces trajectories whose induced policies transfer to unseen websites and tasks rests on the untested assumption that the diversity term captures cross-domain interaction patterns rather than merely reducing in-domain redundancy. Without targeted ablations (e.g., diversity term removed, random selection baseline, or correlation analysis between marginal gains and OOD robustness), the observed gains on WebArena/WorkArena/MiniWob could be explained by the other modifications instead.

minor comments (1)

The abstract states a selection budget but does not report its concrete value or sensitivity analysis; adding this would improve clarity without altering the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions that will be incorporated to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the reported OOD gains and speedups are presented without error bars, exact baseline implementation details, or an ablation isolating the diversity term from AXTree pruning and rationale replacement; these omissions are load-bearing because they prevent determining whether the central selection procedure, rather than the auxiliary efficiency steps, drives the claimed improvements.

Authors: We agree that error bars, precise baseline details, and an isolating ablation are necessary to attribute gains clearly. In the revised manuscript we will add error bars to all reported OOD and speedup results, expand the experimental section with exact baseline implementation details (including training hyperparameters, data preprocessing, and model versions), and insert a dedicated ablation that holds AXTree pruning and rationale replacement fixed while varying only the selection objective (full importance-diversity vs. importance-only vs. random). These changes will isolate the contribution of the core selection procedure. revision: yes
Referee: [Method (objective and greedy algorithm)] Method section describing the objective and greedy algorithm: the claim that optimizing unary importance plus pairwise diversity over states/websites/patterns produces trajectories whose induced policies transfer to unseen websites and tasks rests on the untested assumption that the diversity term captures cross-domain interaction patterns rather than merely reducing in-domain redundancy. Without targeted ablations (e.g., diversity term removed, random selection baseline, or correlation analysis between marginal gains and OOD robustness), the observed gains on WebArena/WorkArena/MiniWob could be explained by the other modifications instead.

Authors: We acknowledge that the current manuscript does not contain an explicit ablation removing the diversity term or a correlation analysis linking diversity metrics to OOD gains. To address this directly, the revision will add (i) an ablation that removes the pairwise diversity component while retaining importance scoring, AXTree pruning, and rationale replacement, (ii) a random-selection baseline matched for budget, and (iii) a supplementary analysis correlating per-trajectory diversity scores with observed OOD performance deltas across the three evaluation suites. While we continue to hold that the multi-aspect diversity objective (states, websites, interaction patterns) is motivated by the goal of broader coverage, the requested ablations will provide the empirical evidence needed to substantiate its role in OOD transfer. revision: yes

Circularity Check

0 steps flagged

Empirical selection procedure with no definitional circularity

full rationale

The paper describes Weasel as a practical trajectory selection algorithm that optimizes a unary-importance-plus-pairwise-diversity objective via a stated greedy procedure, followed by AXTree pruning and rationale replacement. Reported OOD gains and 9.7-12.5× speedups are obtained from direct experimental comparisons on AgentTrek, NNetNav, WebArena, WorkArena, and MiniWob with multiple base models; these outcomes are not algebraically forced by any fitted parameter, self-referential normalization, or uniqueness theorem internal to the paper. The method is self-contained against external benchmarks and contains no load-bearing self-citations or ansatzes that reduce the central claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility; inferred elements are the fixed selection budget (hyperparameter) and the claim that the greedy algorithm sufficiently approximates the combinatorial objective. No new physical entities or ad-hoc constants are introduced.

free parameters (1)

selection budget
Fixed number of trajectory steps retained; chosen to control training cost.

axioms (1)

domain assumption Greedy algorithm yields a good approximation to the joint importance-diversity objective
Invoked to make selection tractable for large trajectory pools.

pith-pipeline@v0.9.0 · 5792 in / 1342 out tokens · 66143 ms · 2026-05-21T08:04:22.068509+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate a fixed-budget subset selection problem with a quadratic objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm.
IndisputableMonolith/Foundation/BranchSelection.lean RCLCombiner_isCoupling_iff unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

D(i, j) = max(δ(si, sj), δ(yi, yj)) with δ = 1 − BERTScore

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.