ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Ben Kao; Biqing Qi; Chang Ma; Fangzhi Xu; Haiteng Zhao; Jianing Wang; Kanzhi Cheng; Lingpeng Kong; Qintong Li; Qiushi Sun

arxiv: 2505.19897 · v3 · submitted 2025-05-26 · 💻 cs.AI · cs.CL· cs.CV· cs.HC

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Qiushi Sun , Zhoumianze Liu , Chang Ma , Zichen Ding , Fangzhi Xu , Zhangyue Yin , Haiteng Zhao , Zhenyu Wu

show 13 more authors

Kanzhi Cheng Zhaoyang Liu Jianing Wang Qintong Li Xiangru Tang Tianbao Xie Xiachong Feng Xiang Li Ben Kao Wenhai Wang Biqing Qi Lingpeng Kong Zhiyong Wu

This is my paper

Pith reviewed 2026-05-19 14:11 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.HC

keywords multimodal agentsscientific workflowsbenchmarkautonomous agentsLLM agentsscientific discoveryevaluation

0 comments

The pith

Current multimodal agents reach only a 15 percent success rate on realistic scientific workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ScienceBoard as an environment and benchmark for testing how well multimodal agents can carry out real scientific work using professional software. It includes 169 human-curated tasks drawn from biochemistry, astronomy, and geoinformatics that require agents to interact with dynamic interfaces and complete multi-step processes. Evaluations of leading models show they complete only about 15 percent of these tasks overall. A sympathetic reader would care because reliable automation of such routines could free scientists from repetitive work and speed up discovery. The benchmark also supplies concrete data on where current agents fail and what design changes might help.

Core claim

ScienceBoard consists of a realistic multi-domain environment with dynamic and visually rich scientific workflows that integrate professional software, together with a benchmark of 169 high-quality tasks validated by humans. When state-of-the-art agents such as GPT-4o, Claude 3.7, and UI-TARS are tested in this setting they achieve an overall success rate of only 15 percent and therefore fall short of reliably assisting scientists in complex workflows, while the accompanying analysis yields insights into current limitations and principles for more capable agents.

What carries the argument

ScienceBoard environment and benchmark, which lets agents interact autonomously through varied interfaces with integrated professional software in realistic scientific workflows.

If this is right

Agents must improve at long-horizon planning and interface manipulation to become useful in scientific settings.
Domain-specific failure patterns identified in the evaluation point to concrete targets for better multimodal training.
The benchmark supplies a repeatable standard that future agents can be measured against as capabilities advance.
Higher success on these workflows would allow agents to automate routine steps in biochemistry, astronomy, and geoinformatics research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If agents reach reliable performance here they could be tested next on longer, multi-day experimental campaigns that cross several software tools.
The emphasis on visual and dynamic interfaces suggests that progress may also benefit non-scientific computer-use tasks that share similar requirements.
Expanding the task set to include domains such as materials science or clinical data analysis would test whether the observed limitations are general.

Load-bearing premise

The 169 tasks, curated and validated by humans, accurately represent the complexity and requirements of real scientific discovery workflows across the chosen domains.

What would settle it

A new study in which practicing scientists rate the tasks as unrepresentative of their actual daily work or in which agents reach above 50 percent success on the same 169 tasks would undermine the central evaluation result.

read the original abstract

Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScienceBoard adds a useful new multi-domain benchmark for scientific agents with professional tools, showing a 15% success rate that highlights real gaps, though task realism validation is light on details.

read the letter

The main takeaway is that this paper builds ScienceBoard, a new environment with integrated scientific software across biochemistry, astronomy, and geoinformatics, plus 169 human-curated tasks. Evaluations of current agents like GPT-4o and Claude 3.7 land at roughly 15% overall success, which lines up with the idea that these systems are not yet reliable for complex research workflows. Releasing the code and benchmark publicly is a practical move that lets others test against it directly. The multi-domain scope and focus on visually rich, dynamic interfaces go beyond many existing general agent tests and give a clearer picture of where tool-use agents fall short in professional settings. The analysis of limitations also points to some concrete design ideas for future work. The soft spot is the representativeness claim. The abstract calls the tasks rigorously validated by humans, but it does not report inter-rater reliability numbers, expert realism ratings, or direct comparisons to actual researcher logs or protocols. If the selected tasks are more scripted or lower in ambiguity than typical discovery work, the low success rate could partly reflect benchmark construction rather than pure agent limits. That said, the central empirical result still stands as a measurement on this specific set of tasks. This paper is aimed at groups building or evaluating agents for scientific assistance and at benchmark researchers who want testbeds that involve real software. It deserves a serious referee because the environment and task set are new, the evaluations are straightforward, and the public artifacts make follow-up easy. Referees could usefully press on the validation metrics and error analysis without undermining the core contribution. I would send it to peer review.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces ScienceBoard, comprising a realistic multi-domain environment for dynamic, visually rich scientific workflows integrated with professional software, and a benchmark of 169 human-curated, rigorously validated tasks spanning biochemistry, astronomy, and geoinformatics. Evaluations of multimodal agents using backbones such as GPT-4o, Claude 3.7, and UI-TARS report an overall success rate of 15%, leading to the conclusion that current agents fall short of reliably assisting scientists in complex workflows; the work includes in-depth analysis for addressing limitations and design principles, with code, environment, and benchmark released publicly.

Significance. If the tasks accurately capture real scientific workflows, this provides a valuable empirical benchmark highlighting the gap between state-of-the-art agents and practical scientific assistance, along with actionable insights for future agent design. The public release of the full environment and benchmark is a notable strength that supports reproducibility and community progress in this area.

major comments (1)

[Abstract] Abstract and benchmark description: the central claim that agents 'fall short of reliably assisting scientists in complex workflows' rests on the 169 tasks being representative of real discovery processes, yet the manuscript provides no quantitative validation metrics (e.g., inter-rater reliability, expert realism scores, or comparison to actual researcher logs/protocols). This weakens the strength of the 15% success rate as evidence of fundamental agent limitations versus benchmark design choices.

minor comments (1)

[Abstract] The abstract would benefit from including per-domain or per-backbone success rates alongside the overall 15% figure to give readers immediate context for the headline result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation of minor revision. We address the major comment regarding quantitative validation of the benchmark tasks below.

read point-by-point responses

Referee: [Abstract] Abstract and benchmark description: the central claim that agents 'fall short of reliably assisting scientists in complex workflows' rests on the 169 tasks being representative of real discovery processes, yet the manuscript provides no quantitative validation metrics (e.g., inter-rater reliability, expert realism scores, or comparison to actual researcher logs/protocols). This weakens the strength of the 15% success rate as evidence of fundamental agent limitations versus benchmark design choices.

Authors: We appreciate the referee's point on strengthening the evidence for task representativeness. The 169 tasks were curated by domain experts with direct research experience in biochemistry, astronomy, and geoinformatics, with each task explicitly designed to replicate authentic workflows involving professional software interfaces (e.g., data processing pipelines, image analysis tools, and spatial modeling environments). The curation process included multiple rounds of expert review for feasibility and realism. We acknowledge that the manuscript does not currently report quantitative metrics such as inter-rater reliability or aggregated expert realism scores. In the revised manuscript, we will add a dedicated subsection on benchmark construction that includes these details—specifically, average expert ratings on workflow realism (on a 1-5 scale) and measures of agreement among curators—to more rigorously support that the 15% success rate reflects agent limitations rather than benchmark artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no load-bearing circularity

full rationale

The paper's core contribution is the creation of the ScienceBoard environment and a benchmark of 169 human-curated tasks, followed by direct empirical measurement of agent success rates (e.g., 15% overall). No mathematical derivations, equations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The evaluation chain is a straightforward measurement against an externally defined set of tasks rather than a self-referential reduction. Minor self-citation risk exists in any benchmark paper but is not load-bearing here, as the result is falsifiable via replication on the released benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on standard assumptions about agent interaction capabilities and the representativeness of curated tasks; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human-curated tasks in the benchmark faithfully capture the demands of real scientific workflows.
This premise underpins the claim that the 15% success rate reflects current agent limitations in assisting scientists.

pith-pipeline@v0.9.0 · 5874 in / 1148 out tokens · 49382 ms · 2026-05-19T14:11:30.739593+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ UCSF ChimeraX … Lean 4 … Celestia … GrassGIS … TeXstudio
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

average success rate of agents ranges between 0% to 15%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

MANTRA automatically synthesizes SMT-validated compliance benchmarks for LLM agents from natural language manuals and tool schemas, producing 285 tasks across 6 domains with minimal human effort.
Beyond Chat and Clicks: GUI Agents for In-Situ Assistance via Live Interface Transformation
cs.HC 2026-04 unverdicted novelty 7.0

GUI agents can transform live web interfaces in real-time via DOM manipulations to deliver contextual assistance directly within the application.
Gym-Anything: Turn any Software into an Agent Environment
cs.LG 2026-04 unverdicted novelty 6.0

Gym-Anything turns arbitrary software into agent environments via multi-agent setup and auditing, creating CUA-World with 10K+ long-horizon tasks and showing that trajectory distillation plus test-time auditing improv...
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
Heterogeneous Scientific Foundation Model Collaboration
cs.AI 2026-04 unverdicted novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
cs.AI 2026-04 unverdicted novelty 4.0

CMBAgent achieves high accuracy on well-specified astrophysical tasks with context but generates silent, plausible-yet-incorrect outputs on reasoning-challenging problems, with no self-diagnosis of inconsistencies.