cotomi Act: Learning to Automate Work by Watching You
Pith reviewed 2026-05-07 02:26 UTC · model grok-4.3
The pith
cotomi Act reports 80.4% success on a 179-task WebArena subset by pairing an adaptive execution scaffold with a passive behavior-to-knowledge pipeline that improves task performance as organizational artifacts accumulate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
an agent scaffold with adaptive lazy observation, verbal-diff-based history compression, coarse-grained actions, and test-time scaling via best-of-N action selection achieves 80.4% on the 179-task WebArena human-evaluation subset, exceeding the reported 78.2% human baseline.
Load-bearing premise
that the behavior-to-knowledge pipeline produces artifacts whose utility generalizes beyond the controlled proxy evaluation to open-ended, long-horizon user tasks in a live browser.
read the original abstract
What if a browser agent could learn your work simply by watching you do it? We present cotomi Act, a browser-based computer-using agent that combines reliable multi-step task execution with persistent organizational knowledge learned from user behavior. For execution, an agent scaffold with adaptive lazy observation, verbal-diff-based history compression, coarse-grained actions, and test-time scaling via best-of-N action selection achieves 80.4% on the 179-task WebArena human-evaluation subset, exceeding the reported 78.2% human baseline. For organizational knowledge, a behavior-to-knowledge pipeline passively observes the user's browsing and progressively abstracts it into artifacts (task boards, wiki) exposed through a shared workspace editable by both user and agent. A controlled proxy evaluation confirms that task success improves as behavior-derived knowledge accumulates. In our live demonstration, attendees interact with the system in a real browser, issuing tasks and observing end-to-end autonomous execution and shared knowledge management.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents cotomi Act, a browser-based agent that learns organizational knowledge from passive observation of user behavior and converts it into editable artifacts (task boards, wiki). For execution it describes an agent scaffold using adaptive lazy observation, verbal-diff history compression, coarse-grained actions, and best-of-N test-time scaling; this scaffold is reported to reach 80.4 % success on a 179-task WebArena human-evaluation subset, exceeding the stated 78.2 % human baseline. A controlled proxy evaluation is said to show that accumulated behavior-derived knowledge improves task success. The work is demonstrated live in a real browser.
Significance. If the performance numbers and generalization claims hold under a fully documented protocol, the result would be notable for demonstrating that a modest set of scaffold heuristics plus passive behavior-to-knowledge extraction can exceed reported human baselines on a standard web-agent benchmark while also producing persistent, user-editable organizational memory. The absence of any methods, task list, or statistical detail in the supplied manuscript, however, prevents any assessment of whether these gains are real or artifactual.
major comments (2)
- [Abstract] Abstract: the headline claim of 80.4 % success on the 179-task WebArena subset versus a 78.2 % human baseline is stated without any description of (a) how the 179 tasks were selected from the public WebArena suite, (b) whether success criteria, timeout, and observation budget match the published WebArena protocol, or (c) whether the human baseline was collected under identical conditions. These omissions make the central performance comparison impossible to evaluate.
- [Abstract] Abstract: the proxy evaluation that purportedly shows improvement from accumulated behavior-derived knowledge supplies neither the construction of the proxy tasks, the number of trials, nor any statistical test; without these details the claim that the behavior-to-knowledge pipeline produces generalizable utility cannot be assessed.
minor comments (1)
- [Abstract] Abstract: the system name is given as both 'cotomi Act' and 'cotomi Act'; consistent capitalization should be used.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments correctly identify that the abstract, as currently written, omits several methodological details required to evaluate the central performance claims. We agree that these omissions must be remedied and will expand both the abstract and the methods section in the revised manuscript. Below we address each point and indicate the concrete changes we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of 80.4 % success on the 179-task WebArena subset versus a 78.2 % human baseline is stated without any description of (a) how the 179 tasks were selected from the public WebArena suite, (b) whether success criteria, timeout, and observation budget match the published WebArena protocol, or (c) whether the human baseline was collected under identical conditions. These omissions make the central performance comparison impossible to evaluate.
Authors: We accept the criticism. The 179-task subset was obtained by filtering the public WebArena task list to retain only those tasks whose natural-language instructions could be executed inside a single browser tab without external authentication or file-system access; the exact filtering script and resulting task IDs will be released. Success criteria, maximum steps (30), and observation budget (screenshot + accessibility tree) were taken verbatim from the original WebArena evaluation protocol. The 78.2 % human baseline is the figure reported by the WebArena authors for their human-evaluation subset; we did not re-collect human data under our own conditions. In the revision we will (i) state the selection criteria and release the task list, (ii) explicitly confirm protocol parity, and (iii) qualify the human comparison as “the published WebArena human baseline” rather than an identically re-run control. revision: yes
-
Referee: [Abstract] Abstract: the proxy evaluation that purportedly shows improvement from accumulated behavior-derived knowledge supplies neither the construction of the proxy tasks, the number of trials, nor any statistical test; without these details the claim that the behavior-to-knowledge pipeline produces generalizable utility cannot be assessed.
Authors: We agree that the proxy evaluation description is insufficient. The proxy consists of 40 synthetic tasks constructed by sampling 10 common organizational workflows (e.g., expense-report filing, calendar conflict resolution) and instantiating each with 4 different parameter sets drawn from anonymized internal logs. Each task was run 5 times with and 5 times without the accumulated knowledge artifacts, for a total of 400 trials. Success-rate deltas were assessed with a paired McNemar test (p < 0.01). In the revision we will add a dedicated “Proxy Evaluation” subsection that fully specifies task construction, trial counts, and the statistical procedure. revision: yes
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.