Orchard: An Open-Source Agentic Modeling Framework
Pith reviewed 2026-05-22 09:46 UTC · model grok-4.3
The pith
A lightweight environment service lets open models learn agent behaviors from distilled trajectories of larger systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that Orchard, built on Orchard Env, supplies reusable primitives for environment management that enable scalable agent training. In the coding recipe, 107K trajectories distilled from larger models, credit-assignment supervised fine-tuning, and Balanced Adaptive Rollout produce 64.3 percent success after SFT and 67.5 percent after SFT plus RL on SWE-bench Verified, a new high for open-source models of comparable size. Parallel recipes yield 74.1 percent, 67.0 percent, and 64.0 percent on three GUI benchmarks with a 4B vision-language model and 59.6 percent pass@3 on an assistant evaluation with only 0.2K synthetic tasks.
What carries the argument
Orchard Env, a lightweight environment service that supplies reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages.
If this is right
- Open-source models of moderate size can reach competitive agent performance on established benchmarks when supplied with distilled trajectories and the described training steps.
- The same environment layer supports effective training recipes for coding agents, vision-language GUI agents, and personal assistant agents.
- Reusable data, training pipelines, and evaluations become feasible across agent domains without proprietary codebases.
- Credit assignment during supervised fine-tuning allows learning from productive segments even in trajectories that do not fully solve the task.
Where Pith is reading between the lines
- The environment layer could be adapted with modest effort to new domains such as scientific tool use or multi-agent coordination.
- If the distillation and rollout methods generalize, dependence on closed models for high-quality training signals could decrease over time.
- The framework invites tests on whether the same lightweight primitives scale to longer-horizon or noisier real-world environments.
Load-bearing premise
Trajectories distilled from much larger proprietary models combined with credit-assignment SFT and Balanced Adaptive Rollout will reliably transfer productive behavior to smaller open models without requiring extensive additional human data or environment-specific tuning.
What would settle it
Removing the credit-assignment step or the Balanced Adaptive Rollout from the coding recipe and checking whether performance on SWE-bench Verified falls materially below 64.3 percent.
Figures
read the original abstract
Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Orchard, an open-source framework for agentic modeling of LLMs. Its core is Orchard Env, a lightweight service for sandbox lifecycle management. The paper describes three domain-specific recipes built on this layer: Orchard-SWE distills 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, applies credit-assignment SFT and Balanced Adaptive Rollout to a Qwen3-30B base, and reports 64.3% (SFT) and 67.5% (SFT+RL) on SWE-bench Verified; Orchard-GUI trains a 4B vision-language agent on 0.4K distilled trajectories plus 2.2K open-ended tasks, achieving 74.1%, 67.0%, and 64.0% on WebVoyager, Online-Mind2Web, and DeepShop; Orchard-Claw uses 0.2K synthetic tasks to reach 59.6% pass@3 on Claw-Eval (73.9% with stronger harness). The central claim is that the harness-agnostic environment layer enables reusable agentic data, training, and evaluations that yield new open-source SOTA results.
Significance. If the performance numbers and attribution to the proposed techniques hold, the work would be a useful contribution by releasing an open-source environment primitive and training recipes that lower the barrier to reproducible agent research across coding, GUI, and assistant domains. The explicit use of public benchmarks and distillation from large models provides a concrete starting point for follow-on work, though the absence of ablations limits immediate impact.
major comments (3)
- [Abstract] Abstract: the reported 3.2-point lift from SFT to SFT+RL on SWE-bench Verified (64.3% to 67.5%) is presented without error bars, run-to-run variance, or statistical tests; this is load-bearing for the SOTA claim among open-source models of comparable size.
- [Abstract] Abstract (Orchard-SWE paragraph): credit-assignment SFT is described only as learning from 'productive segments of unresolved trajectories' and Balanced Adaptive Rollout is named without pseudocode, scoring rule, or ablation; without these details the performance numbers cannot be attributed to the new techniques rather than teacher-model scale or unreported tuning.
- [Abstract] Abstract (Orchard-GUI paragraph): success rates of 74.1%, 67.0%, and 64.0% are given for a 4B model trained on 0.4K distilled trajectories, yet no baseline comparisons to prior open-source 4B-scale agents or ablation on the 2.2K open-ended tasks are supplied; this weakens the claim of being the strongest open-source model while remaining competitive with proprietary systems.
minor comments (2)
- The manuscript would benefit from a single figure or table summarizing the three recipes, their data sources, and key hyperparameters to improve readability across domains.
- Notation for 'Orchard Env' and the harness-agnostic claim should be introduced with a short pseudocode block or architecture diagram in the framework section.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported 3.2-point lift from SFT to SFT+RL on SWE-bench Verified (64.3% to 67.5%) is presented without error bars, run-to-run variance, or statistical tests; this is load-bearing for the SOTA claim among open-source models of comparable size.
Authors: We agree that error bars, variance estimates, and statistical tests would strengthen the presentation of the 3.2-point improvement. In the revised manuscript we will report evaluation variance from multiple runs on the final checkpoint and add a brief discussion of run-to-run stability in the main text and abstract. We will also note the computational constraints that limited the number of independent training runs. revision: yes
-
Referee: [Abstract] Abstract (Orchard-SWE paragraph): credit-assignment SFT is described only as learning from 'productive segments of unresolved trajectories' and Balanced Adaptive Rollout is named without pseudocode, scoring rule, or ablation; without these details the performance numbers cannot be attributed to the new techniques rather than teacher-model scale or unreported tuning.
Authors: We accept that the abstract description is too terse to support attribution. The full manuscript already contains expanded descriptions of both techniques. In the revision we will move the pseudocode and scoring rule for Balanced Adaptive Rollout to a new appendix section, add an explicit ablation table isolating credit-assignment SFT and the adaptive rollout, and update the abstract to reference these additions. This will allow readers to distinguish the contribution of our methods from teacher-model scale. revision: yes
-
Referee: [Abstract] Abstract (Orchard-GUI paragraph): success rates of 74.1%, 67.0%, and 64.0% are given for a 4B model trained on 0.4K distilled trajectories, yet no baseline comparisons to prior open-source 4B-scale agents or ablation on the 2.2K open-ended tasks are supplied; this weakens the claim of being the strongest open-source model while remaining competitive with proprietary systems.
Authors: We will add a table of direct comparisons against previously published open-source 4B-scale GUI agents on the same three benchmarks. For the contribution of the 2.2K open-ended tasks we will include an ablation that reports performance when training on the 0.4K distilled trajectories alone. These results will appear in the main results section and be summarized in the abstract. revision: partial
Circularity Check
No significant circularity in claimed results or framework
full rationale
The paper reports empirical performance on external public benchmarks (SWE-bench Verified, WebVoyager, etc.) after distilling trajectories from separately trained larger models and applying described training recipes. No equations, self-definitions, or fitted parameters are presented that reduce the reported success rates or state-of-the-art claims to quantities defined or fitted inside the same system by construction. The central results remain anchored to independent external evaluations rather than internal consistency checks or self-citations that bear the load of the performance numbers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Trajectories from much larger models can be distilled and refined via credit-assignment SFT and Balanced Adaptive Rollout to produce strong performance in smaller open models.
invented entities (1)
-
Orchard Env
no independent evidence
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2601.11868. MiniMax. MiniMax M2.5: Built for real-world productivity. https://www.minimax.io/ news/minimax-m25, February 2026. HuggingFace: https://huggingface.co/MiniMaxAI/ MiniMax-M2.5. Modal Labs. Modal: High-performance AI infrastructure.https://modal.com, 2024. Moonshot AI. Kimi Code CLI. URL https://github.com/MoonshotAI/kim...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3711896 2026
-
[2]
ISSN 2835-8856. URLhttps://openreview.net/forum?id=xChvYjvXTp. Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, Zhiding Yu, Jan Kautz, and Yi Dong. ProRL Agent: Rollout-as-a-service for RL training of multi-turn LLM agents, 2026a. URL https: //arxiv.org/abs/2603.18815. Lei Zhang, ...
-
[3]
This helps you track progress and avoid repeating actions
**Action History**: Your interaction history showing all previous actions taken to accomplish the current task. This helps you track progress and avoid repeating actions
-
[4]
**User Request**: The primary objective that clearly specifies the task you need to complete. This is your main goal
-
[5]
**Observation**: Current state information about the web page, including: - **Tab Info**: The currently active tab index and a list of all open tabs with their index, URL, and page title - **Screenshot**: Visual representation of the current page state - **A11y Tree** *(optional)*: Accessibility tree containing interactive elements with their IDs, types, ...
- [6]
-
[7]
Once on the main Amazon page, search for "dog bed washable 30 inches"
-
[9]
Review results and find a suitable product The next logical action is to click the "Continue shopping" button to proceed with the task. This is necessary before I can access the search functionality of Amazon. </think> <tool_call> {"name": "click", "arguments": {"point_2d": [501, 213]}} </tool_call> <|im_end|> <|im_start|>user <tool_response> Succeed:`cli...
-
[12]
Review results and find a suitable product The next logical step is to click on the search bar so I can type my search query. The search bar is clearly visible at the top of the page with the placeholder text "Search Amazon". Once I click on it, I'll be able to type my search terms for dog beds that are washable and at least 30 inches long. I should click...
-
[13]
Search for "dog bed washable 30 inches" in the Amazon search bar
-
[15]
Review results and find a suitable product Now that the search bar is focused, the next logical step is to type my search query. I should enter "dog bed washable 30 inches" to find products that match all the criteria specified in the user request. This search term includes all the key requirements: - It's for a dog bed - It specifies "washable" as a requ...
- [16]
-
[17]
Apply filters for washable beds with length >= 30 inches
-
[18]
Review results and find a suitable product Now that I've entered the search query, the next logical step is to submit the search. I can do this by clicking the search button (the magnifying glass icon) next to the search bar. This will take me to the search results page where I can see all the dog beds that match my criteria. The search button is visible ...
- [19]
-
[20]
After filtering, review the results to confirm they meet the length requirement (at least 30 inches)
-
[21]
Select a suitable product that meets both criteria The next logical action is to click the "Washable" checkbox under Special Features to filter the results to only show washable dog beds. This will help narrow down the options and make it easier to find products that meet the user's requirements. </think> <tool_call> {"name": "click", "arguments": {"point...
-
[22]
Is washable (confirmed by the filter)
-
[23]
Has a length of 35 inches (exceeding the 30-inch requirement) I'll click on this product to verify the details and see if it's a suitable match for the user's request. </think> <tool_call> {"name": "click", "arguments": {"point_2d": [496, 514]}} </tool_call> <|im_end|> <|im_start|>user <tool_response> Succeed:`click`on <span> "EHEYCIGA Washable Large Dog ...
-
[24]
It's washable - confirmed by the product name and description ("Washable Large Dog Bed")
-
[25]
It has a length of 35 inches - explicitly stated in the product title as "35x23 Inch" and confirmed in the size specifications at the bottom of the page showing "35" x 23" x 3.5"" The product details confirm it's designed for dogs up to 45 lbs and is labeled as "Amazon's Choice" with good reviews (4.7 stars from 381 reviews). It's also marked as having su...
-
[26]
Remove common evaluation benchmarks.We strip out splits that overlap with our held- out benchmarks (e.g., Online-Mind2Web (Deng et al., 2023) and DeepShop (Lyu et al., 2025)) to prevent train/test contamination, retaining only the two comple- mentary PAE-WebVoyager (Zhou et al., 2025) and InSTA-v3 (Trabucco et al., 2025) splits (-13,840, 4.7% → 278,252). ...
work page 2023
-
[27]
Keep parent tasks only.WebGym additionally provides child tasks decomposed from each parent intent. Since child tasks share substantial structure with their parents, we retain only the parents to avoid intra-family redundancy (-23,437, 8.4% →254,815)
-
[28]
Exclude WebVoyager tasks.We further drop any task whose intent appears in the original WebVoyager benchmark, eliminating residual contamination at the prompt level (-411, 0.2%→254,404)
-
[29]
Restrict to popular websites.Long-tail websites are noisier (more captchas, anti-bot blocks, broken pages) and less representative of realistic browsing. We keep only tasks whose target site falls within the SimilarWeb Top-100 list and the MOZ Top 500 Most Popular Websites, and where the same site has at least two tasks, ensuring sufficient per-site cover...
-
[30]
Semantic deduplication.The remaining pool is dominated by near-duplicate in- tents (e.g., paraphrases of the same shopping or search query across thousands of products). We embed each task intent with Qwen/Qwen3-Embedding-8B and greed- ily remove tasks whose cosine similarity to a previously kept task exceeds 0.99 (-124,454, 88.9%→15,601). The final filte...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.