pith. sign in

arxiv: 2605.15040 · v2 · pith:B7XP6NKOnew · submitted 2026-05-14 · 💻 cs.AI · cs.CL

Orchard: An Open-Source Agentic Modeling Framework

Pith reviewed 2026-05-22 09:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords agentic modelingopen-source frameworkenvironment servicetrajectory distillationSWE-benchcomputer use agentreinforcement learningmulti-turn interaction
0
0 comments X

The pith

A lightweight environment service lets open models learn agent behaviors from distilled trajectories of larger systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Orchard as an open-source framework for turning language models into agents that plan, reason, and interact with environments over multiple turns. Its central piece is a reusable environment service that handles sandbox management for different tasks and training stages. The authors apply this service to three domains, distilling trajectories from larger models and using targeted supervised fine-tuning plus reinforcement learning to reach leading open-source results on coding benchmarks, web navigation tasks, and personal assistant evaluations. If the approach holds, open research could advance capable agents without depending on closed infrastructure or large amounts of new human data.

Core claim

The authors claim that Orchard, built on Orchard Env, supplies reusable primitives for environment management that enable scalable agent training. In the coding recipe, 107K trajectories distilled from larger models, credit-assignment supervised fine-tuning, and Balanced Adaptive Rollout produce 64.3 percent success after SFT and 67.5 percent after SFT plus RL on SWE-bench Verified, a new high for open-source models of comparable size. Parallel recipes yield 74.1 percent, 67.0 percent, and 64.0 percent on three GUI benchmarks with a 4B vision-language model and 59.6 percent pass@3 on an assistant evaluation with only 0.2K synthetic tasks.

What carries the argument

Orchard Env, a lightweight environment service that supplies reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages.

If this is right

  • Open-source models of moderate size can reach competitive agent performance on established benchmarks when supplied with distilled trajectories and the described training steps.
  • The same environment layer supports effective training recipes for coding agents, vision-language GUI agents, and personal assistant agents.
  • Reusable data, training pipelines, and evaluations become feasible across agent domains without proprietary codebases.
  • Credit assignment during supervised fine-tuning allows learning from productive segments even in trajectories that do not fully solve the task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The environment layer could be adapted with modest effort to new domains such as scientific tool use or multi-agent coordination.
  • If the distillation and rollout methods generalize, dependence on closed models for high-quality training signals could decrease over time.
  • The framework invites tests on whether the same lightweight primitives scale to longer-horizon or noisier real-world environments.

Load-bearing premise

Trajectories distilled from much larger proprietary models combined with credit-assignment SFT and Balanced Adaptive Rollout will reliably transfer productive behavior to smaller open models without requiring extensive additional human data or environment-specific tuning.

What would settle it

Removing the credit-assignment step or the Balanced Adaptive Rollout from the coding recipe and checking whether performance on SWE-bench Verified falls materially below 64.3 percent.

Figures

Figures reproduced from arXiv: 2605.15040 by Alessandro Sordoni, Baolin Peng, Hao Cheng, Jianfeng Gao, Pengcheng He, Qianhui Wu, Rui Yang, Tao Ge, Tong Zhang, Wenlin Yao, Xiao Yu, Xingdi Yuan, Yelong Shen, Zhou Yu.

Figure 1
Figure 1. Figure 1: Performance comparison. Left: Orchard-SWE (30B) reaches 67.5% on SWE￾bench Verified, approaching frontier MoE systems 10–30× larger. Right: Orchard-GUI (4B) achieves 68.4% average success across WebVoyager, Online-Mind2web, and DeepShop, making it the strongest open-source GUI agent while staying on par with proprietary systems from OpenAI and Google. 1 Introduction Large language model (LLM) agents that i… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Orchard framework. Orchard Env (center) is a thin, Kubernetes [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Orchard Env architecture. A Python client SDK (synchronous or asynchronous) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task-filtering pipeline. Starting from 292,092 raw tasks, we sequentially remove evaluation-benchmark overlap, child tasks, WebVoyager intents, long-tail sites, and near￾duplicate intents (semantic similarity ≥ 0.99 under Qwen3-Embedding-8B), yielding a final pool of 15,601 deduplicated seed tasks on popular websites. 0 1000 2000 3000 4000 5000 Tasks Science & Research Misc. Lifestyle & Leisure Career & Ed… view at source ↗
Figure 5
Figure 5. Figure 5: Composition of the filtered seed task pool. Left: task share by top-level domain (6 categories spanning 15,601 tasks). Right: the seed task pool covers 57.0% of SimilarWeb Top-100 Most Visited and 85.0% of MOZ Top-500 Most Popular websites, with 13.0% and 48.5% of tasks landing on those respective lists. 68.4% of tasks have at least one passing rollout, 26.3% pass on all four rollouts, and the remaining 31… view at source ↗
Figure 6
Figure 6. Figure 6: Per-website and per-task success outcomes. Left: share of tasks by outcome across the 4 rollouts – 26.3% all-pass, 42.1% mixed (1–3 of 4), and 31.6% all-fail. Of the all-fail tasks, 41.1% (2,026 tasks; 13.0% of the full 15,601-task pool) are captcha-blocked on every one of their four rollouts. Right: per-website teacher success rate for the top-30 websites by rollout count, split into two columns; bar leng… view at source ↗
Figure 7
Figure 7. Figure 7: Orchard-GUI RL Training and Evaluation Curve. The red curves denote RL train￾ing starting from our SFT checkpoint, while the blue curves denote RL training initialized from the base model. Compared with the base model initialization, the SFT-initialized model achieves consistently higher evaluation success rates and more stable reward improvements throughout training. soning trace followed by one or more <… view at source ↗
Figure 8
Figure 8. Figure 8: Orchard-Claw RL training curves. Left: train and validation success rate over RL steps. Right: train and validation episode length (number of agent turns per rollout). Validation tasks are sampled from the ClawEval benchmark. Both metrics rise steadily over the course of training, indicating that the agent learns to solve more tasks while also engaging in longer multi-turn interactions [PITH_FULL_IMAGE:fi… view at source ↗
read the original abstract

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Orchard, an open-source framework for agentic modeling of LLMs. Its core is Orchard Env, a lightweight service for sandbox lifecycle management. The paper describes three domain-specific recipes built on this layer: Orchard-SWE distills 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, applies credit-assignment SFT and Balanced Adaptive Rollout to a Qwen3-30B base, and reports 64.3% (SFT) and 67.5% (SFT+RL) on SWE-bench Verified; Orchard-GUI trains a 4B vision-language agent on 0.4K distilled trajectories plus 2.2K open-ended tasks, achieving 74.1%, 67.0%, and 64.0% on WebVoyager, Online-Mind2Web, and DeepShop; Orchard-Claw uses 0.2K synthetic tasks to reach 59.6% pass@3 on Claw-Eval (73.9% with stronger harness). The central claim is that the harness-agnostic environment layer enables reusable agentic data, training, and evaluations that yield new open-source SOTA results.

Significance. If the performance numbers and attribution to the proposed techniques hold, the work would be a useful contribution by releasing an open-source environment primitive and training recipes that lower the barrier to reproducible agent research across coding, GUI, and assistant domains. The explicit use of public benchmarks and distillation from large models provides a concrete starting point for follow-on work, though the absence of ablations limits immediate impact.

major comments (3)
  1. [Abstract] Abstract: the reported 3.2-point lift from SFT to SFT+RL on SWE-bench Verified (64.3% to 67.5%) is presented without error bars, run-to-run variance, or statistical tests; this is load-bearing for the SOTA claim among open-source models of comparable size.
  2. [Abstract] Abstract (Orchard-SWE paragraph): credit-assignment SFT is described only as learning from 'productive segments of unresolved trajectories' and Balanced Adaptive Rollout is named without pseudocode, scoring rule, or ablation; without these details the performance numbers cannot be attributed to the new techniques rather than teacher-model scale or unreported tuning.
  3. [Abstract] Abstract (Orchard-GUI paragraph): success rates of 74.1%, 67.0%, and 64.0% are given for a 4B model trained on 0.4K distilled trajectories, yet no baseline comparisons to prior open-source 4B-scale agents or ablation on the 2.2K open-ended tasks are supplied; this weakens the claim of being the strongest open-source model while remaining competitive with proprietary systems.
minor comments (2)
  1. The manuscript would benefit from a single figure or table summarizing the three recipes, their data sources, and key hyperparameters to improve readability across domains.
  2. Notation for 'Orchard Env' and the harness-agnostic claim should be introduced with a short pseudocode block or architecture diagram in the framework section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 3.2-point lift from SFT to SFT+RL on SWE-bench Verified (64.3% to 67.5%) is presented without error bars, run-to-run variance, or statistical tests; this is load-bearing for the SOTA claim among open-source models of comparable size.

    Authors: We agree that error bars, variance estimates, and statistical tests would strengthen the presentation of the 3.2-point improvement. In the revised manuscript we will report evaluation variance from multiple runs on the final checkpoint and add a brief discussion of run-to-run stability in the main text and abstract. We will also note the computational constraints that limited the number of independent training runs. revision: yes

  2. Referee: [Abstract] Abstract (Orchard-SWE paragraph): credit-assignment SFT is described only as learning from 'productive segments of unresolved trajectories' and Balanced Adaptive Rollout is named without pseudocode, scoring rule, or ablation; without these details the performance numbers cannot be attributed to the new techniques rather than teacher-model scale or unreported tuning.

    Authors: We accept that the abstract description is too terse to support attribution. The full manuscript already contains expanded descriptions of both techniques. In the revision we will move the pseudocode and scoring rule for Balanced Adaptive Rollout to a new appendix section, add an explicit ablation table isolating credit-assignment SFT and the adaptive rollout, and update the abstract to reference these additions. This will allow readers to distinguish the contribution of our methods from teacher-model scale. revision: yes

  3. Referee: [Abstract] Abstract (Orchard-GUI paragraph): success rates of 74.1%, 67.0%, and 64.0% are given for a 4B model trained on 0.4K distilled trajectories, yet no baseline comparisons to prior open-source 4B-scale agents or ablation on the 2.2K open-ended tasks are supplied; this weakens the claim of being the strongest open-source model while remaining competitive with proprietary systems.

    Authors: We will add a table of direct comparisons against previously published open-source 4B-scale GUI agents on the same three benchmarks. For the contribution of the 2.2K open-ended tasks we will include an ablation that reports performance when training on the 0.4K distilled trajectories alone. These results will appear in the main results section and be summarized in the abstract. revision: partial

Circularity Check

0 steps flagged

No significant circularity in claimed results or framework

full rationale

The paper reports empirical performance on external public benchmarks (SWE-bench Verified, WebVoyager, etc.) after distilling trajectories from separately trained larger models and applying described training recipes. No equations, self-definitions, or fitted parameters are presented that reduce the reported success rates or state-of-the-art claims to quantities defined or fitted inside the same system by construction. The central results remain anchored to independent external evaluations rather than internal consistency checks or self-citations that bear the load of the performance numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claims rest on the domain assumption that large-model trajectories can be productively distilled and refined with the stated SFT and RL techniques, plus the introduction of a new environment abstraction whose utility is demonstrated only through the reported experiments.

axioms (1)
  • domain assumption Trajectories from much larger models can be distilled and refined via credit-assignment SFT and Balanced Adaptive Rollout to produce strong performance in smaller open models.
    Invoked for all three recipes, especially Orchard-SWE.
invented entities (1)
  • Orchard Env no independent evidence
    purpose: Lightweight reusable service for sandbox lifecycle management across agent tasks and pipeline stages.
    Presented as the foundational layer enabling the three recipes; no external validation supplied in the abstract.

pith-pipeline@v0.9.0 · 5947 in / 1511 out tokens · 50547 ms · 2026-05-22T09:46:16.795737+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    URLhttps://arxiv.org/abs/2601.11868. MiniMax. MiniMax M2.5: Built for real-world productivity. https://www.minimax.io/ news/minimax-m25, February 2026. HuggingFace: https://huggingface.co/MiniMaxAI/ MiniMax-M2.5. Modal Labs. Modal: High-performance AI infrastructure.https://modal.com, 2024. Moonshot AI. Kimi Code CLI. URL https://github.com/MoonshotAI/kim...

  2. [2]

    type": "function

    ISSN 2835-8856. URLhttps://openreview.net/forum?id=xChvYjvXTp. Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, Zhiding Yu, Jan Kautz, and Yi Dong. ProRL Agent: Rollout-as-a-service for RL training of multi-turn LLM agents, 2026a. URL https: //arxiv.org/abs/2603.18815. Lei Zhang, ...

  3. [3]

    This helps you track progress and avoid repeating actions

    **Action History**: Your interaction history showing all previous actions taken to accomplish the current task. This helps you track progress and avoid repeating actions

  4. [4]

    This is your main goal

    **User Request**: The primary objective that clearly specifies the task you need to complete. This is your main goal

  5. [5]

    click -> write -> press Enter

    **Observation**: Current state information about the web page, including: - **Tab Info**: The currently active tab index and a list of all open tabs with their index, URL, and page title - **Screenshot**: Visual representation of the current page state - **A11y Tree** *(optional)*: Accessibility tree containing interactive elements with their IDs, types, ...

  6. [6]

    Continue shopping

    Click the "Continue shopping" button to get past this verification page

  7. [7]

    dog bed washable 30 inches

    Once on the main Amazon page, search for "dog bed washable 30 inches"

  8. [9]

    Continue shopping

    Review results and find a suitable product The next logical action is to click the "Continue shopping" button to proceed with the task. This is necessary before I can access the search functionality of Amazon. </think> <tool_call> {"name": "click", "arguments": {"point_2d": [501, 213]}} </tool_call> <|im_end|> <|im_start|>user <tool_response> Succeed:`cli...

  9. [12]

    Search Amazon

    Review results and find a suitable product The next logical step is to click on the search bar so I can type my search query. The search bar is clearly visible at the top of the page with the placeholder text "Search Amazon". Once I click on it, I'll be able to type my search terms for dog beds that are washable and at least 30 inches long. I should click...

  10. [13]

    dog bed washable 30 inches

    Search for "dog bed washable 30 inches" in the Amazon search bar

  11. [15]

    dog bed washable 30 inches

    Review results and find a suitable product Now that the search bar is focused, the next logical step is to type my search query. I should enter "dog bed washable 30 inches" to find products that match all the criteria specified in the user request. This search term includes all the key requirements: - It's for a dog bed - It specifies "washable" as a requ...

  12. [16]

    dog bed washable 30 inches

    Execute the search for "dog bed washable 30 inches"

  13. [17]

    Apply filters for washable beds with length >= 30 inches

  14. [18]

    name": "click

    Review results and find a suitable product Now that I've entered the search query, the next logical step is to submit the search. I can do this by clicking the search button (the magnifying glass icon) next to the search bar. This will take me to the search results page where I can see all the dog beds that match my criteria. The search button is visible ...

  15. [19]

    Washable

    Apply the "Washable" filter by clicking the checkbox in the left sidebar

  16. [20]

    After filtering, review the results to confirm they meet the length requirement (at least 30 inches)

  17. [21]

    Washable

    Select a suitable product that meets both criteria The next logical action is to click the "Washable" checkbox under Special Features to filter the results to only show washable dog beds. This will help narrow down the options and make it easier to find products that meet the user's requirements. </think> <tool_call> {"name": "click", "arguments": {"point...

  18. [22]

    Is washable (confirmed by the filter)

  19. [23]

    name": "click

    Has a length of 35 inches (exceeding the 30-inch requirement) I'll click on this product to verify the details and see if it's a suitable match for the user's request. </think> <tool_call> {"name": "click", "arguments": {"point_2d": [496, 514]}} </tool_call> <|im_end|> <|im_start|>user <tool_response> Succeed:`click`on <span> "EHEYCIGA Washable Large Dog ...

  20. [24]

    Washable Large Dog Bed

    It's washable - confirmed by the product name and description ("Washable Large Dog Bed")

  21. [25]

    35x23 Inch

    It has a length of 35 inches - explicitly stated in the product title as "35x23 Inch" and confirmed in the size specifications at the bottom of the page showing "35" x 23" x 3.5"" The product details confirm it's designed for dogs up to 45 lbs and is labeled as "Amazon's Choice" with good reviews (4.7 stars from 381 reviews). It's also marked as having su...

  22. [26]

    Remove common evaluation benchmarks.We strip out splits that overlap with our held- out benchmarks (e.g., Online-Mind2Web (Deng et al., 2023) and DeepShop (Lyu et al., 2025)) to prevent train/test contamination, retaining only the two comple- mentary PAE-WebVoyager (Zhou et al., 2025) and InSTA-v3 (Trabucco et al., 2025) splits (-13,840, 4.7% → 278,252). ...

  23. [27]

    Since child tasks share substantial structure with their parents, we retain only the parents to avoid intra-family redundancy (-23,437, 8.4% →254,815)

    Keep parent tasks only.WebGym additionally provides child tasks decomposed from each parent intent. Since child tasks share substantial structure with their parents, we retain only the parents to avoid intra-family redundancy (-23,437, 8.4% →254,815)

  24. [28]

    Exclude WebVoyager tasks.We further drop any task whose intent appears in the original WebVoyager benchmark, eliminating residual contamination at the prompt level (-411, 0.2%→254,404)

  25. [29]

    Restrict to popular websites.Long-tail websites are noisier (more captchas, anti-bot blocks, broken pages) and less representative of realistic browsing. We keep only tasks whose target site falls within the SimilarWeb Top-100 list and the MOZ Top 500 Most Popular Websites, and where the same site has at least two tasks, ensuring sufficient per-site cover...

  26. [30]

    We embed each task intent with Qwen/Qwen3-Embedding-8B and greed- ily remove tasks whose cosine similarity to a previously kept task exceeds 0.99 (-124,454, 88.9%→15,601)

    Semantic deduplication.The remaining pool is dominated by near-duplicate in- tents (e.g., paraphrases of the same shopping or search query across thousands of products). We embed each task intent with Qwen/Qwen3-Embedding-8B and greed- ily remove tasks whose cosine similarity to a previously kept task exceeds 0.99 (-124,454, 88.9%→15,601). The final filte...