ClawGym: A Scalable Framework for Building Effective Claw Agents

Bryan Dai; Chuan Hao; Daixuan Cheng; Fei Bai; Feng Chang; Huatong Song; Jian Yang; Ji-Rong Wen; Ran Tao; Renyuan Li

arxiv: 2604.26904 · v3 · pith:2WCQIODVnew · submitted 2026-04-29 · 💻 cs.CL · cs.AI· cs.LG

ClawGym: A Scalable Framework for Building Effective Claw Agents

Fei Bai , Huatong Song , Shuang Sun , Daixuan Cheng , Yike Yang , Chuan Hao , Renyuan Li , Feng Chang

show 6 more authors

Yuan Wei Ran Tao Bryan Dai Jian Yang Wayne Xin Zhao Ji-Rong Wen

This is my paper

Pith reviewed 2026-05-20 23:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords claw-styleclawgymframeworkscalableagentconstructdevelopmentenvironments

0 comments

The pith

ClawGym provides synthetic data generation, SFT and RL training, and a benchmark for developing Claw-style agents that operate over local files and tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Claw-style environments let AI agents carry out tasks that need several steps, such as working with files on a computer, calling tools, and keeping track of changes in a workspace over time. Building good agents for these settings has been difficult due to lack of good training data and evaluation methods. ClawGym offers a complete system to create such agents. It starts by making a dataset of 13,500 tasks drawn from different user personas and the skills needed for file and tool operations, using fake workspaces and checks to confirm success. Models are then trained on examples of agent actions collected from running in these setups, first with supervised fine-tuning and then with reinforcement learning across many parallel simulations. A test set of 200 cases, filtered automatically and reviewed by humans and LLMs, helps measure performance. The authors released the code and resources online.

Core claim

we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms.

Load-bearing premise

The synthesized tasks from persona-driven intents combined with hybrid verification mechanisms produce training data and benchmarks that lead to agents capable of effective generalization in real Claw-style environments.

Figures

Figures reproduced from arXiv: 2604.26904 by Bryan Dai, Chuan Hao, Daixuan Cheng, Fei Bai, Feng Chang, Huatong Song, Jian Yang, Ji-Rong Wen, Ran Tao, Renyuan Li, Shuang Sun, Wayne Xin Zhao, Yike Yang, Yuan Wei.

**Figure 2.** Figure 2: Task distribution of persona-driven synthesis. Figure (a) shows the distribution of user-facing [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: RL training curves on ClawGym-Bench. Scores are computed using only code-based verifiers [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of training trajectory scale on SFT Model using ClawGym-SynData. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of trajectory filtering reward threshold on SFT Model ClawGym-SynData. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: The stronger trajectory builds a computation-and-verification pipeline, while the weaker one [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: A representative case of error recovery in long-horizon execution. The stronger trajectory [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: A representative case of fine-grained requirement satisfaction. The weaker trajectory produces [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

read the original abstract

Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes. To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources have been released at https://github.com/ClawGym.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClawGym supplies a concrete pipeline for synthetic task generation and agent training in file-and-tool environments, but the evidence for real-world generalization is missing.

read the letter

This paper describes ClawGym as a framework for the full cycle of building agents that handle multi-step file, tool, and workspace tasks. The key point is that it supplies a scalable way to create training data and train models, yet the actual performance gains in real settings are not clearly shown. They create ClawGym-SynData with 13.5K filtered tasks drawn from persona-driven intents and skill-grounded operations. These come with realistic mock workspaces and hybrid verification to ensure quality. Training happens first through supervised fine-tuning on black-box rollout trajectories, then reinforcement learning in a setup that runs rollouts in parallel per-task sandboxes. Evaluation uses ClawGym-Bench, a set of 200 instances filtered automatically and reviewed by humans and LLMs. The GitHub release makes the resources available for others. The work stands out for integrating these pieces into one system tailored to Claw-style environments. For developers focused on agent training pipelines, the details on how they generate diverse tasks and set up the RL sandbox could be directly useful. The main soft spot is the assumption that this synthetic setup produces agents that generalize effectively outside the mocks. The description does not include transfer validation experiments, ablations on mock versus real complexity, or breakdowns of failures related to persistent state or tool variations. If those factors are not well covered in the mocks, the agents could do fine on the benchmark but struggle in actual use. This makes the claim of supporting effective agent development rest on an untested step. Readers working on practical agent systems or synthetic data methods for tool use would get value from the concrete pipeline. It is coherent enough and provides enough open material to warrant a serious referee, though reviewers will probably ask for more on the validation side. I would recommend putting it through peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces ClawGym, a scalable framework supporting the full lifecycle of Claw-style personal agent development in environments involving multi-step workflows over local files, tools, and persistent states. It constructs ClawGym-SynData, a dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations using realistic mock workspaces and hybrid verification mechanisms. ClawGym-Agents are trained via supervised fine-tuning on black-box rollout trajectories, with exploration of a lightweight reinforcement learning pipeline that parallelizes rollouts across per-task sandboxes. Evaluation is supported by ClawGym-Bench, a 200-instance benchmark calibrated via automated filtering and human-LLM review. Resources are released at a GitHub repository.

Significance. If the synthesized data and training pipeline produce agents that generalize effectively, the work would offer a practical contribution by addressing data scarcity and evaluation bottlenecks for agents in stateful, tool-rich personal computing environments. The open release of the dataset, benchmark, and framework could enable reproducible progress in this area.

major comments (2)

[Abstract] Abstract: The claim of training 'a family of capable Claw-style models' through SFT on black-box rollouts lacks any reported quantitative metrics (e.g., success rates, error analysis, or baseline comparisons on ClawGym-Bench). This is load-bearing for the central claim that the framework supports building effective agents.
[Abstract] Abstract: Generalization from the 13.5K synthetic tasks in mock workspaces to real persistent Claw environments is asserted without reported transfer validation, ablation on mock vs. real complexity, or analysis of failure modes such as persistent state handling, file permissions, or tool variances.

minor comments (2)

[Abstract] Abstract: The hybrid verification mechanisms are mentioned but not described in sufficient detail to assess how they ensure reliable task filtering and labeling.
[Abstract] Abstract: The scale and implementation details of the parallelized RL rollout pipeline across sandboxes would benefit from additional clarification for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, agreeing where revisions are needed to better support our claims while clarifying the scope of the current work.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of training 'a family of capable Claw-style models' through SFT on black-box rollouts lacks any reported quantitative metrics (e.g., success rates, error analysis, or baseline comparisons on ClawGym-Bench). This is load-bearing for the central claim that the framework supports building effective agents.

Authors: We agree that the abstract would benefit from explicit quantitative support for the claim of capable models. The manuscript presents the training pipeline and benchmark construction in detail, but to directly address this concern we will revise the abstract to include key success rates on ClawGym-Bench, a summary of error analysis, and baseline comparisons from the evaluation section. revision: yes
Referee: [Abstract] Abstract: Generalization from the 13.5K synthetic tasks in mock workspaces to real persistent Claw environments is asserted without reported transfer validation, ablation on mock vs. real complexity, or analysis of failure modes such as persistent state handling, file permissions, or tool variances.

Authors: We acknowledge that direct transfer validation to real environments is not reported. The mock workspaces were designed with realistic persistent states and hybrid verification to approximate real conditions, but we will add a new limitations subsection that discusses generalization gaps, provides qualitative analysis of failure modes including state handling and tool variances, and frames this as an important direction for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: framework construction and data synthesis are independent of outputs

full rationale

The paper presents ClawGym as a constructed framework for synthesizing 13.5K tasks from persona-driven intents and skill-grounded operations, paired with mock workspaces and hybrid verification, followed by SFT on black-box rollouts and a 200-instance benchmark. No equations, fitted parameters, or predictions are described that reduce by construction to the paper's own inputs or outputs. No self-citations are invoked as load-bearing uniqueness theorems, and the central claims rest on the novelty of the synthesis pipeline and training process rather than renaming known results or smuggling ansatzes. The derivation chain is self-contained against external benchmarks of agent development.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified quality of the 13.5K synthetic tasks and the hybrid verification; no explicit free parameters, new physical entities, or formal axioms are stated in the abstract.

axioms (1)

domain assumption Persona-driven intents and skill-grounded operations generate diverse, realistic tasks that support effective agent training.
Invoked to justify construction of ClawGym-SynData.

pith-pipeline@v0.9.0 · 5758 in / 1291 out tokens · 53053 ms · 2026-05-20T23:53:24.152351+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 2 internal anchors

[1]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

GitHub repository. Accessed: 2026-04-29. 24 24 Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. Gdpva...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

OpenClaw-RL: Train Any Agent Simply by Talking

GitHub repository. 37 Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026. 25 38 Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple y...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

GitHub repository. Accessed: 2026-04-29. 24 24 Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. Gdpva...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

OpenClaw-RL: Train Any Agent Simply by Talking

GitHub repository. 37 Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026. 25 38 Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple y...

work page internal anchor Pith review Pith/arXiv arXiv 2026