SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Baobao Chang; Elvis Zhang; Jason Zeng; Jialong Wu; Kean Shi; Kuan Li; Liang Chen; Michael Heinrich; Ming Wu; Qingyao Yang

arxiv: 2605.15777 · v2 · pith:YMQQ42OKnew · submitted 2026-05-15 · 💻 cs.AI

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Kean Shi , Zihang Li , Tianyi Ma , Zengji Tu , Jialong Wu , Xinbo Xu , Qingyao Yang , Ruoyu Wu

show 8 more authors

Weichu Xie Ming Wu Jason Zeng Michael Heinrich Elvis Zhang Liang Chen Kuan Li Baobao Chang

This is my paper

Pith reviewed 2026-05-20 18:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords SaaS-Benchcomputer-using agentsLLM agentsprofessional workflowsbenchmarktask completionGUI agents

0 comments

The pith

LLM-based computer-using agents complete fewer than 4% of realistic professional SaaS tasks end-to-end.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SaaS-Bench as a benchmark for evaluating computer-using agents in real software-as-a-service environments. It includes 106 tasks across 23 SaaS systems in six professional domains, requiring long-horizon interactions and coordination. Experiments with representative agents show success rates below 4% for the strongest models, pointing to weaknesses in planning, tracking states across applications, and recovering from errors. This evaluation matters because it tests agents on the kind of dynamic, multi-step work that professionals do daily in tools like project management and collaboration software. A sympathetic reader would conclude that current agent designs are not yet ready for complex real-world deployment.

Core claim

SaaS-Bench is introduced as a benchmark built on 23 deployable SaaS systems across six domains with 106 tasks grounded in realistic scenarios. These tasks involve long-horizon execution in both text and multimodal settings and use weighted verification checkpoints to measure completion and progress. Representative LLM-based agents struggle, with the strongest completing fewer than 4% of tasks end-to-end, revealing limitations in planning, state tracking, cross-application context maintenance, and error recovery.

What carries the argument

SaaS-Bench benchmark with its 106 tasks and weighted verification checkpoints, which evaluates agents on dynamic system states and cross-application coordination in professional SaaS environments.

If this is right

Current agents lack the ability to maintain context across multiple applications over long periods.
Error recovery is a critical missing capability for handling real workflows.
Both planning and state tracking need significant improvement to achieve practical utility.
The benchmark highlights the need for agents that can handle multimodal inputs effectively in GUI settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If true, this implies that future agent research should prioritize architectures with explicit memory or state management modules.
The results could motivate development of hybrid systems that combine LLM reasoning with rule-based automation for SaaS tasks.
Extending the benchmark to include more domains might reveal domain-specific strengths or weaknesses in agent performance.

Load-bearing premise

The assumption that the selected 106 tasks accurately represent realistic professional workflows and that the weighted checkpoints reliably indicate task success or partial progress.

What would settle it

A new agent design that achieves end-to-end completion on more than 20% of the 106 tasks would challenge the reported limitations of current approaches.

Figures

Figures reproduced from arXiv: 2605.15777 by Baobao Chang, Elvis Zhang, Jason Zeng, Jialong Wu, Kean Shi, Kuan Li, Liang Chen, Michael Heinrich, Ming Wu, Qingyao Yang, Ruoyu Wu, Tianyi Ma, Weichu Xie, Xinbo Xu, Zengji Tu, Zihang Li.

**Figure 1.** Figure 1: Leaderboard of SAAS-BENCH. We report overall checkpoint scores (bar length) and resolved scores for seven frontier models across 106 long-horizon SaaS tasks. ∗Equal Core Contributors †Correspondence: Liang Chen <liangchen@unipat.ai>, Kuan Li <kuanli@unipat.ai>, Baobao Chang <chbb@pku.edu.cn> 1 arXiv:2605.15777v1 [cs.AI] 15 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: SAAS-BENCH provides a realistic benchmark for evaluating CUAs in deployable SaaS environments. It consists of 23 real SaaS systems organized into six professional domains, supporting 106 tasks that reflect real-world SaaS workflows. 1 Introduction Recent advances in Large Language Models (LLMs) have enabled the emergence of Computer-Using Agents (CUAs) Qin et al. (2025); Wang et al. (2025); OpenAI (2025);… view at source ↗

**Figure 3.** Figure 3: Overview of SAAS-BENCH. Agents receive natural-language task instructions and interact with locally deployed SaaS applications through browser-use. After execution, task outcomes are evaluated using verification tools, which are aggregated into resolved score and checkpoint score. systems, while a Business. task may involve CRM, finance, and structured record management systems. This domain-and-cluster org… view at source ↗

**Figure 4.** Figure 4: Task statistics of SAAS-BENCH. (a) Nested donut showing the breakdown of SAAS-BENCH tasks across the two evaluation modes (text-only and multimodal), six task domains, and the underlying SaaS applications. The outer ring quantifies how often each application is exercised, illustrating the diversity of real-world tools spanned by the benchmark. (b) Combined view of (top) the per-task application count and (… view at source ↗

**Figure 5.** Figure 5: Task synthesis pipeline of SAAS-BENCH. Starting from domain-specific task seeds and occupational roles, SAAS-BENCH synthesizes candidate tasks through an iterative Builder–Challenger– Refiner loop for template generation and instantiation. The generated tasks are then filtered by static rubric-based checking and execution check, ensuring that the final tasks are realistic, executable, and verifiable. such … view at source ↗

**Figure 6.** Figure 6: Pass@k average best scores (k = 1, 2, 3) for four models on SAAS-BENCH across three evaluation splits: text-only, multimodal, and overall. Each bar is divided into three segments: the dark base represents pass@1, the mid-tone segment shows the incremental gain from pass@1 to pass@2, and the lightest segment shows the further gain to pass@3. solution, verifier, database schema, or backend API is exposed. Th… view at source ↗

**Figure 7.** Figure 7: Left: Distribution of low-level actions emitted by Claude Opus 4.6 over the full benchmark; Right: categorization of failed verification checks by failure mode. Together the two panels link execution behaviour to the dominant failure types. 1 2 3 4 # distinct apps per task 0 20 40 60 80 100 Avg. score (%) (a) Score vs. # apps 0 50 100 150 200 250 300 350 400 Operation length (steps, Opus) 0 20 40 60 80 100… view at source ↗

**Figure 8.** Figure 8: Per-task score as a function of three structural complexity measures: ( [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Per-domain composition of agent behaviour errors observed in the trajectories of Opus 4.6. [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Average pass rate of verification check [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

read the original abstract

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SaaS-Bench gives a more realistic testbed for agents on live professional tools, but the sub-4% success numbers rest on verification checkpoints whose construction is not yet clear enough to fully trust.

read the letter

The main point is that this paper shows representative agents doing very poorly on a new set of tasks built from real SaaS tools, with the best ones finishing under 4 percent end to end. It suggests that planning, state tracking, and error recovery remain big hurdles when agents have to work across applications over long sequences. What the paper gets right is the construction of the benchmark itself. They assembled 23 deployable SaaS systems across six professional domains and created 106 tasks grounded in realistic scenarios. These tasks involve long-horizon work, cross-application coordination, and both text and multimodal elements. That is more grounded than the isolated or simplified settings common in prior agent benchmarks. Making the code available helps others check the results or build on them. The softer part is the evaluation method. The results rely on weighted verification checkpoints to judge full completion versus partial progress. The abstract does not provide much on how those checkpoints were selected, any agreement checks between raters, or tests of how the scores change if the weights shift. If the checkpoints do not fully capture critical state changes or if they under-penalize losses in context across apps, the low success rates could partly come from the scoring rules rather than the agents alone. The paper does flag the agent limitations, but the strength of that claim tracks how well the checkpoints reflect actual task success. This work is for people studying computer-using agents who need evaluation settings closer to professional use. Readers who care about moving agent research past toy environments will find the setup and the reported gaps useful. It has enough substance to go to a serious referee, particularly since new benchmarks can shape what the field measures next. I would recommend putting it through peer review, with attention to the verification details in the revisions.

Referee Report

1 major / 1 minor

Summary. The paper introduces SaaS-Bench, a benchmark built on 23 deployable real-world SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks emphasize long-horizon execution, dynamic states, cross-application coordination, and both text-only and multimodal interactions. Evaluation uses weighted verification checkpoints to measure strict end-to-end task completion as well as partial progress. Experiments with representative LLM-based computer-use agents report that even the strongest model completes fewer than 4% of tasks end-to-end, highlighting limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code is released for reproduction.

Significance. If the tasks accurately reflect professional workflows and the verification method reliably distinguishes full completion from partial progress, the benchmark would fill a notable gap left by existing simplified web and GUI agent evaluations. The reported sub-4% success rates would then constitute a concrete, falsifiable signal of current agent shortcomings in realistic SaaS settings. The public code release is a clear strength that supports reproducibility and future extensions.

major comments (1)

[§3] §3 (Benchmark Construction) and the associated verification protocol: the manuscript does not provide quantitative details on how the weighted checkpoints were derived, how weights were assigned to sub-steps, inter-rater agreement for task grounding, or sensitivity analysis for missing critical state transitions. Because the central claim of <4% end-to-end success rests on these checkpoints accurately measuring strict completion rather than benchmark artifacts, this omission is load-bearing and requires explicit documentation or supplementary material.

minor comments (1)

The abstract contains a minor grammatical issue ('Code are available' should read 'Code is available').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below and commit to revisions that directly respond to the concerns raised about documentation of the verification protocol.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction) and the associated verification protocol: the manuscript does not provide quantitative details on how the weighted checkpoints were derived, how weights were assigned to sub-steps, inter-rater agreement for task grounding, or sensitivity analysis for missing critical state transitions. Because the central claim of <4% end-to-end success rests on these checkpoints accurately measuring strict completion rather than benchmark artifacts, this omission is load-bearing and requires explicit documentation or supplementary material.

Authors: We agree that the current manuscript provides only a high-level description of the weighted verification checkpoints in §3 and that additional quantitative details are required to substantiate the evaluation protocol. In the revised version we will expand §3 with a new subsection that (1) explains the derivation process, including the use of domain-expert review to identify critical state transitions and assign weights proportionally to their impact on task completion; (2) reports the exact weighting scheme and the rationale for each weight value; (3) presents inter-rater agreement statistics (Cohen’s κ) obtained from the three annotators who independently grounded each task and its checkpoints; and (4) includes a sensitivity analysis (moved to the appendix) that perturbs checkpoint weights and omits selected state transitions to show that the reported sub-4 % end-to-end success rate remains stable. These additions will be supported by new tables and will not alter any experimental results. We believe the expanded documentation will eliminate concerns about benchmark artifacts while preserving the paper’s central claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are direct measurements

full rationale

The paper introduces SaaS-Bench as a new collection of 106 tasks on 23 real SaaS systems and reports measured agent success rates (under 4% end-to-end) from direct experiments. No equations, fitted parameters, or derivations are present; the headline percentages are observations on the constructed benchmark rather than quantities forced by self-definition, renamed fits, or self-citation chains. Task design and weighted checkpoints are presented as independent engineering choices grounded in professional scenarios, with no reduction of the reported outcomes back to the inputs by construction. This is a standard empirical benchmark paper whose central claims remain falsifiable against external agent runs and do not rely on any load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an empirical benchmark and evaluation protocol; it does not introduce new mathematical axioms, free parameters, or postulated entities beyond standard assumptions about agent capabilities.

axioms (1)

domain assumption SaaS environments naturally involve dynamic system states, cross-application coordination, and long-horizon dependencies suitable for CUA evaluation.
Stated in the abstract as justification for choosing SaaS platforms over existing simplified benchmarks.

pith-pipeline@v0.9.0 · 5826 in / 1179 out tokens · 52831 ms · 2026-05-20T18:59:13.163215+00:00 · methodology

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)