ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Qiushi Sun , Zhoumianze Liu , Chang Ma , Zichen Ding , Fangzhi Xu , Zhangyue Yin , Haiteng Zhao , Zhenyu Wu , Kanzhi Cheng , Zhaoyang Liu , Jianing Wang , Qintong Li , Xiangru Tang , Tianbao Xie , Xiachong Feng , Xiang Li , Ben Kao , Wenhai Wang , Biqing Qi , Lingpeng Kong , Zhiyong Wu

Authors on Pith no claims yet

classification 💻 cs.AI cs.CLcs.CVcs.HC

keywords agentsscientificworkflowsaddressingbenchmarkcapablecomplexdiscovery

0 comments

read the original abstract

Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
cs.CL 2026-05 unverdicted novelty 7.0

MANTRA automatically synthesizes SMT-validated compliance benchmarks for LLM agents from natural language manuals and tool schemas, producing 285 tasks across 6 domains with minimal human effort.
Beyond Chat and Clicks: GUI Agents for In-Situ Assistance via Live Interface Transformation
cs.HC 2026-04 unverdicted novelty 7.0

GUI agents can transform live web interfaces in real-time via DOM manipulations to deliver contextual assistance directly within the application.
Gym-Anything: Turn any Software into an Agent Environment
cs.LG 2026-04 unverdicted novelty 6.0

Gym-Anything turns arbitrary software into agent environments via multi-agent setup and auditing, creating CUA-World with 10K+ long-horizon tasks and showing that trajectory distillation plus test-time auditing improv...
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
Heterogeneous Scientific Foundation Model Collaboration
cs.AI 2026-04 unverdicted novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows
cs.AI 2026-04 unverdicted novelty 4.0

CMBAgent achieves high accuracy on well-specified astrophysical tasks with context but generates silent, plausible-yet-incorrect outputs on reasoning-challenging problems, with no self-diagnosis of inconsistencies.