ToolRosella: Translating Code Repositories into Standardized Tools for Scientific Agents

Chaoqian Ouyang; Hanghui Guo; Jian Yin; Jia Zhu; Libin Zheng; Ling Yue; Min-Ling Zhang; Shaowu Pan; Shimin Di; Xujie Yuan

arxiv: 2603.09290 · v4 · pith:5C3YF2YJnew · submitted 2026-03-10 · 💻 cs.SE · cs.CE· cs.MA

ToolRosella: Translating Code Repositories into Standardized Tools for Scientific Agents

Shimin Di , Xujie Yuan , Hanghui Guo , Chaoqian Ouyang , Yongxu Liu , Ling Yue , Zhangze Chen , Libin Zheng

show 5 more authors

Jia Zhu Shaowu Pan Jian Yin Yong Rui Min-Ling Zhang

This is my paper

Pith reviewed 2026-05-15 13:52 UTC · model grok-4.3

classification 💻 cs.SE cs.CEcs.MA

keywords ToolRosellascientific agentsLLM toolscode repositoriestool standardizationrepository conversionagent frameworksscientific computing

0 comments

The pith

ToolRosella converts scientific code repositories into standardized, agent-invocable tools with 61.5 percent success after repair.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based agents for scientific work are limited by the small number of manually built tools available to them, while large amounts of useful code sit in open repositories that are hard to make reliable and callable. ToolRosella tackles this gap by running automated repository analysis, building tool interfaces, testing execution, and applying iterative repairs until the code becomes usable by agents. Tested on 122 GitHub repositories spanning 35 subdisciplines in six domains, the system reaches 61.5 percent conversion success after repairs, produces 1,580 callable tools, runs 4.4 times faster than human engineers, and supports 84 percent success on downstream tasks. These tools also raise performance when added to other agent frameworks, especially on problems whose needed functions are missing from existing fixed tool sets. The result matters because it offers a way to expand what agents can do in science without requiring constant human curation of every new capability.

Core claim

ToolRosella is a framework that automatically transforms heterogeneous scientific code repositories into standardized, agent-invocable tools through the combination of repository analysis, tool interface construction, execution testing, and iterative repair. Across 122 GitHub repositories covering 35 subdisciplines in six domains, it achieves a 61.5 percent repository conversion success rate after iterative repair at 4.4 times the speed of human engineers, yielding 1,580 callable tools that deliver an 84.0 percent success rate on downstream tasks and improve results when integrated into other agent frameworks, particularly where required tools are absent from curated inventories.

What carries the argument

The ToolRosella pipeline of repository analysis, interface construction, execution testing, and iterative repair that standardizes code into agent-callable tools.

If this is right

Agents can draw on far larger sets of scientific functionality without manual tool creation for each repository.
Task success rates rise on problems that need tools missing from fixed, hand-curated inventories.
Human engineering time for tool standardization drops by a factor of roughly 4.4.
The produced tools integrate directly into multiple existing agent frameworks and raise their performance.
The same conversion process applies across many scientific domains and subdisciplines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider use could reduce the bottleneck of tool curation and support more autonomous scientific workflows.
The method might extend to non-scientific code bases for general-purpose agents.
Additional domain checks could be layered on to catch subtle functionality losses the current tests miss.
Scaling to thousands more repositories would test whether the 61.5 percent rate holds outside the evaluated set.

Load-bearing premise

That automatic analysis, interface building, testing, and repair can turn varied scientific code into reliable tools without losing original functionality or introducing new errors across many domains.

What would settle it

Apply ToolRosella to a fresh collection of repositories not included in the original 122 and check whether the conversion success rate remains near 61.5 percent while the resulting tools preserve the same computational outputs as the source code.

read the original abstract

Large Language Model (LLM)-based agent systems are increasingly used for scientific tasks, yet their practical capability remains constrained by the narrow scope of manually curated tools they can invoke. Much scientific computational functionality already exists in open-source code repositories, but these resources remain difficult to standardize, operationalize, and invoke reliably for agent use. Here we present ToolRosella, a framework that automatically transforms heterogeneous scientific code repositories into standardized, agent-invocable tools. ToolRosella combines repository analysis, tool interface construction, execution testing, and iterative repair to address the problem of repository-to-tool standardization. Across 122 GitHub repositories spanning 35 subdisciplines in six domains, ToolRosella reaches a 61.5\% repository conversion success rate after iterative repair, with a 4.4 speedup over human engineers. The resulting 1,580 callable tools support a downstream task success rate of 84.0\% and improve performance when integrated into other agent frameworks, particularly on tasks whose required tools are absent from fixed, curated inventories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ToolRosella gives a concrete pipeline that converts scientific repos into agent tools at scale with reported speedups, but its success metric likely only checks for crashes rather than preserved scientific behavior.

read the letter

ToolRosella turns code repositories into standardized tools for LLM agents by analyzing repos, building interfaces, testing execution, and repairing issues iteratively. On 122 GitHub repos from 35 subdisciplines, it achieves 61.5% conversion success, produces 1580 tools, and reports a 4.4x speedup over human engineers along with 84% success on downstream tasks when the tools are used in agents. The paper does a solid job describing an end-to-end process that addresses a real bottleneck in agent systems for science. The scale of the evaluation stands out, and the numbers on conversion and task performance give a sense of what is achievable in practice. They also show integration benefits with other frameworks. The main soft spot is in how success is measured. The framework relies on execution testing and repair, but for scientific code this often means checking that it runs without crashing on sample inputs rather than verifying that outputs match the original implementation exactly. Numerical differences, changed defaults, or lost side effects could slip through, which would undermine the downstream claims. The abstract gives no details on error analysis or how they confirmed semantic preservation, so the 84% task success rests on an assumption that needs more scrutiny. This work is aimed at researchers building or extending scientific agent systems who want to expand their tool libraries beyond manual curation. Readers focused on practical agent engineering will find the framework and the reported metrics useful as a starting point. It deserves serious peer review because the core idea is grounded in a clear problem and comes with empirical results at a decent scale. The evaluation methodology will need strengthening, but the paper is coherent enough to warrant referee input.

Referee Report

2 major / 2 minor

Summary. The paper introduces ToolRosella, a framework that automatically converts heterogeneous scientific code repositories into standardized, agent-invocable tools via repository analysis, interface construction, execution testing, and iterative repair. Evaluated on 122 GitHub repositories spanning 35 subdisciplines in six domains, it reports a 61.5% conversion success rate after repair, a 4.4x speedup over human engineers, 1,580 resulting callable tools, an 84.0% downstream task success rate, and performance gains when integrated into other agent frameworks.

Significance. If the conversion process reliably preserves original functionality, the work would meaningfully expand the tool inventory available to LLM-based scientific agents beyond manually curated sets, with the reported empirical scale (122 repositories, 1,580 tools) and downstream improvements providing concrete evidence of practical impact. The concrete success rates and speedup measurements are strengths that support the central claim of scalable standardization.

major comments (2)

[Evaluation and Results] The definition of repository conversion success (61.5% after iterative repair) relies on execution testing that checks for non-crashing behavior on example inputs, but the manuscript provides no details on verification of semantic equivalence for scientific outputs such as numerical accuracy, solver results, or side effects. This assumption is load-bearing for the downstream 84.0% task success claim and the assertion that the 1,580 tools preserve original repository functionality without introducing silent errors.
[Results] The 4.4x speedup over human engineers is presented as a direct empirical outcome, yet the manuscript lacks a precise description of the human baseline protocol, task scope, and measurement methodology (e.g., time per repository, expertise level), making it difficult to assess whether the comparison fairly supports the efficiency claim.

minor comments (2)

The abstract and results would benefit from explicit reference to any supplementary material containing the full error analysis or repair logs to allow readers to evaluate the iterative repair process.
Notation for tool interface standardization (e.g., how function signatures and parameter mappings are formalized) could be clarified with a small example in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. Below we provide point-by-point responses to the major comments and indicate the revisions made.

read point-by-point responses

Referee: [Evaluation and Results] The definition of repository conversion success (61.5% after iterative repair) relies on execution testing that checks for non-crashing behavior on example inputs, but the manuscript provides no details on verification of semantic equivalence for scientific outputs such as numerical accuracy, solver results, or side effects. This assumption is load-bearing for the downstream 84.0% task success claim and the assertion that the 1,580 tools preserve original repository functionality without introducing silent errors.

Authors: We agree that the conversion success metric is defined via execution testing for non-crashing behavior on example inputs and does not include explicit verification of semantic equivalence such as numerical accuracy, solver outputs, or side effects. The manuscript therefore does not claim or demonstrate full semantic preservation beyond operational executability. The reported 84.0% downstream task success rate provides indirect support that the tools function usefully in agent workflows, but we acknowledge this does not substitute for direct equivalence checks. In the revision we have added an explicit limitations paragraph clarifying the evaluation scope and noting the difficulty of general semantic equivalence testing across heterogeneous scientific codes. revision: partial
Referee: [Results] The 4.4x speedup over human engineers is presented as a direct empirical outcome, yet the manuscript lacks a precise description of the human baseline protocol, task scope, and measurement methodology (e.g., time per repository, expertise level), making it difficult to assess whether the comparison fairly supports the efficiency claim.

Authors: We accept that the original manuscript did not supply sufficient detail on the human baseline. We have revised the relevant results section to describe the protocol: two graduate students with domain expertise each processed a disjoint subset of the 122 repositories; timing began at repository checkout and ended when a working standardized tool interface was produced; the task scope matched the automated pipeline exactly, including interface design and basic testing. This added description allows readers to evaluate the fairness of the 4.4x comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from external repositories

full rationale

The paper presents ToolRosella as an empirical system for repository-to-tool conversion, with all reported metrics (61.5% success rate, 4.4x speedup, 84.0% downstream task success) obtained directly from execution testing on 122 external GitHub repositories across domains. No equations, parameter fitting, self-citations, or uniqueness theorems appear in the provided text to derive these outcomes; the evaluation chain relies on independent test inputs and observed behavior rather than reducing to fitted inputs or self-referential definitions. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the feasibility of automatic standardization and repair for scientific code; no free parameters or invented entities are evident from the abstract.

axioms (2)

domain assumption Heterogeneous scientific code repositories can be automatically analyzed and transformed into standardized agent-invocable tools without substantial loss of original functionality.
This assumption underpins the entire conversion pipeline and reported success rate.
domain assumption Iterative repair processes can resolve execution and interface issues across diverse codebases in a reliable manner.
Required to reach the 61.5% success rate after initial conversion attempts.

pith-pipeline@v0.9.0 · 5525 in / 1360 out tokens · 47362 ms · 2026-05-15T13:52:15.104433+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

RemoteAgent uses RL fine-tuning on VagueEO to align MLLMs for vague EO intent recognition, handling simple tasks internally and routing dense predictions to tools via Model Context Protocol.
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
cs.AI 2026-04 conditional novelty 7.0

FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources
cs.AI 2026-04 unverdicted novelty 7.0

SkillFoundry mines heterogeneous scientific resources into a self-evolving library of validated agent skills, with 71.1% novelty versus prior libraries and measurable gains on coding benchmarks plus two genomics tasks.