Toward Autonomous Long-Horizon Engineering for ML Research
Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3
The pith
AiScientist achieves higher performance on long-horizon ML research benchmarks by using hierarchical orchestration and a File-as-Bus workspace.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present AiScientist as a system for long-horizon ML research engineering that integrates hierarchical orchestration with a permission-scoped File-as-Bus workspace. The orchestrator exerts thin control by issuing concise summaries and maintaining a workspace map, while specialized agents re-ground their work on durable artifacts including analyses, plans, code, and experimental evidence. This architecture produces coherent multi-stage progress and delivers measurable gains: an average 10.54-point improvement on PaperBench over the strongest baseline and 81.82 Any Medal% on MLE-Bench Lite. Ablation experiments identify the File-as-Bus protocol as a primary contributor to these outcomes.
What carries the argument
The File-as-Bus workspace under hierarchical orchestration: agents exchange and persist project state through files rather than conversation, with an orchestrator providing high-level direction via summaries and maps.
Load-bearing premise
The benchmarks used reflect real-world long-horizon ML research demands and the performance differences arise chiefly from the proposed orchestration and File-as-Bus components.
What would settle it
An experiment showing that a baseline agent with only conversational memory achieves similar scores on PaperBench and MLE-Bench Lite, or a new benchmark where the AiScientist design fails to maintain progress over longer periods.
Figures
read the original abstract
Agentic systems increasingly automate pieces of AI research. Yet turning underspecified research objectives into runnable, experimentally validated ML systems remains a central bottleneck. We study this operational setting as \emph{long-horizon ML research engineering}: converting a research specification into a runnable ML system through repeated implementation, experimentation, and refinement. The central challenge is to sustain cumulative project progress across heterogeneous stages under delayed, confounded feedback. We introduce AiScientist, a multi-agent system built around thin control over thick state: a lightweight hierarchical research team coordinates through a File-as-Bus workspace that preserves decision-relevant artifacts across roles and invocations. On PaperBench, AiScientist improves over the strongest matched baselines by 9.92 and 11.15 points with Gemini-3-Flash and GLM-5, respectively. On MLE-Bench Lite, it reaches 81.82 Any Medal\% under both backbones, improving over the strongest matched baselines by 4.55 and 16.67 points, and exceeding a Codex/GPT-5.5 xhigh frontier harness reference by 13.64 Any Medal points. Ablations and process analyses show that durable project state is central to later-round refinement: removing File-as-Bus lowers PaperBench score by 6.41 points and MLE-Bench Lite Any Medal\% by 31.82 points. These results suggest that long-horizon AI research is not only a problem of stronger local reasoning, but a systems problem of maintaining cumulative, inspectable project progress.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AiScientist, a system for autonomous long-horizon engineering in ML research. It combines hierarchical orchestration, where a top-level Orchestrator uses concise summaries and a workspace map for stage-level control, with specialized agents that rely on a durable File-as-Bus workspace for state continuity instead of conversational handoffs. Evaluations on PaperBench and MLE-Bench Lite show an average 10.54 point improvement on PaperBench over the best matched baseline and 81.82 Any Medal% on MLE-Bench Lite. Ablations indicate that removing the File-as-Bus protocol reduces scores by 6.41 on PaperBench and 31.82 on MLE-Bench Lite.
Significance. Should the results prove robust under controlled conditions, the work is significant in demonstrating that long-horizon ML research tasks benefit from systems-level designs emphasizing structured coordination and persistent state management. The explicit use of benchmarks with reported ablations strengthens the case for this approach over purely reasoning-focused methods.
major comments (1)
- The manuscript states that baselines are 'best matched' and reports ablation results for File-as-Bus removal, but does not detail whether the ablation maintains identical agent sets, model choices, total token budgets, and interaction limits as the full AiScientist system. This information is necessary to attribute the performance differences specifically to the hierarchical orchestration and File-as-Bus design rather than other implementation factors.
minor comments (1)
- The abstract could specify the number of experimental runs or include variance measures for the reported average improvements.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment below and will revise the manuscript to provide the requested experimental controls.
read point-by-point responses
-
Referee: The manuscript states that baselines are 'best matched' and reports ablation results for File-as-Bus removal, but does not detail whether the ablation maintains identical agent sets, model choices, total token budgets, and interaction limits as the full AiScientist system. This information is necessary to attribute the performance differences specifically to the hierarchical orchestration and File-as-Bus design rather than other implementation factors.
Authors: We agree that the manuscript should explicitly document these controls to allow readers to attribute the ablation results to the File-as-Bus protocol. In the revised version we will add a dedicated paragraph in the Experiments section (and update the ablation table caption) stating that the File-as-Bus ablation uses identical agent sets, the same model choices and backends, the same total token budgets, and the same interaction limits as the full AiScientist system. This clarification will be added without altering any reported numbers. revision: yes
Circularity Check
No circularity: empirical benchmark results with no derivations or self-referential loops
full rationale
The paper describes an implemented system (AiScientist) and reports measured performance on external benchmarks (PaperBench, MLE-Bench Lite) plus ablation deltas. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Claims reduce to observed scores rather than any quantity defined in terms of itself or smuggled via prior author work. Attribution concerns (baseline matching, component isolation) are experimental-validity issues, not circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Specialized agents can effectively re-ground on durable file artifacts such as analyses, plans, code, and experimental evidence
invented entities (1)
-
File-as-Bus workspace
no independent evidence
Forward citations
Cited by 3 Pith papers
-
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
-
GEAR: Genetic AutoResearch for Agentic Code Evolution
GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
-
AI for Auto-Research: Roadmap & User Guide
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.